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About This Manual 


Preface 


Read This First 


This user’s guide serves as an applications reference book for the 
TMS320C40 and TMS320C44 digital signal processors (DSP). Throughout 
the book, all references to the TMS320C4x apply to both devices (exceptions 
are noted). 


Specifically, this book complements the TMS320C 4x User’s Guide by provid- 
ing information to assist managers and hardware/software engineers in ap- 
plication development. It includes example code and hardware connections 
for various applications. 


The guide shows how to use the instruction set, the architecture, and the ’C4x 
interface. It presents examples for frequently used applications and discusses 
more involved examples and applications. It also defines the principles in- 
volved in many applications and gives the corresponding assembly language 
code for instructional purposes and for immediate use. Whenever the detailed 
explanation of the underlying theory is too extensive to be included in this 
manual, appropriate references are given for further information. 


How to Use This Manual 


The following table summarizes the information contained in this user’s guide: 


If you are looking for 


information about: Turn to these chapters: 

Arithmetic Chapter 3, Logical and Arithmetic Operations 
Communication Ports Chapter 8, Using the Communication Ports 
Companding Chapter 6, Applications-Oriented Operations 
Development Support Chapter 10, Development Support and Part Or- 


der Information 


If you are looking for 
information about: 


DMA Coprocessor 
FTTs 
Filters 


Ordering Parts 


Repeat Modes 
Reset 

Stacks 

Tips 

Wait States 


XDS510 Emulator 


Style and Symbol Conventions 


Turn to these chapters: 


Chapter 7, Programming the DMA Coprocessor 
Chapter 6, Applications-Oriented Operations 
Chapter 6, Applications-Oriented Operations 


Chapter 10, Development Support and Part Or- 
der Information 


Chapter 2, Program Control 
Chapter 1, Processor Initialization 
Chapter 2, Program Control 
Chapter 5, Programming Tips 
Chapter 4, Memory Interfacing 


Chapter 11, XDS510 Emulator Design Consider- 
ations 


This document uses the following conventions: 


(Lj Program listings, program examples, file names, and symbol names are 
shown in a special font. Examples use a bold version of the special font 
for emphasis. Here is a sample program listing segment: 


* 


*ARO, RO 
*ARO, RO 


*ARO++ (1 
*—ARO (1) ,RO 


LOOP1 RPTB MAX 
CMPF 

MAX LDFLT 
B NEXT 

LOOP2 RPTB MIN 
CMPF 

MIN LDF LT 

NEXT 


;Compare number to the maximum 
;If greater, this is a new max 


),RO ;Compare number to the minimum 
;If smaller, this is new minimum 


(1 Throughout this book MSB indicates the most significant bit and LSB indi- 
cates the least significant bit. MS indicates the most significant byte and 
LS indicates the least significant byte. 


Information About Cautions and Warnings 


This book may contain cautions and warnings. 


This is an example of a caution statement. 


A caution statement describes a situation that could potentially 
damage your software or equipment. 


This is an example of a warning statement. 


A warning statement describes a situation that could potentially 
cause harm to you. 


The information in a caution or a warning is provided for your protection. 
Please read each caution and warning carefully. 


Related Documentation From Texas Instruments 


vi 


The following books describe the TMS320 floating-point devices and related 
support tools. To obtain a copy of any of these TI documents, call the Texas 
Instruments Literature Response Center at (800) 477-8924. When ordering, 
please identify the book by its title and literature number. 


TMS320C4x User’s Guide (literature number SPRU063) describes the 'C4x 
32-bit floating-point processor, developed for digital signal processing as 
well as parallel processing applications. Covered are its architecture, in- 
ternal register structure, instruction set, pipeline, specifications, and op- 
eration of its six DMA channels and six communication ports. 


TMS320C4x Parallel Processing Development System Technical Refer- 
ence (literature number SPRUO75) describes the TMS320C4x parallel 
processing system, a system with four C4xs with shared and distributed 
memory. 


Parallel Processing with the TMS320C4x (literature number SPRA031) de- 
scribes parallel processing and how the ’C4x can be used in parallel pro- 
cessing. Also provides sample parallel processing applications. 


TMS320C3x/C4x Assembly Language Tools User’s Guide (literature 
number SPRU035) describes the assembly language tools (assembler, 
linker, and other tools used to develop assembly language code), 
assembler directives, macros, common object file format, and symbolic 
debugging directives for the ‘C3x and ’C4x generations of devices. 


TMS320 Floating-Point DSP Optimizing C Compiler User’s Guide (litera- 
ture number SPRU034) describes the TMS320 floating-point C compiler. 
This C compiler accepts ANSI standard C source code and produces 
TMS320 assembly language source code for the ’C3x and ’C4x genera- 
tions of devices. 


TMS320C4x C Source Debugger User’s Guide (literature number 
SPRUO054) tells you how to invoke the ’C4x emulator and simulator ver- 
sions of the C source debugger interface. This book discusses various 
aspects of the debugger interface, including window management, com- 
mand entry, code execution, data management, and breakpoints. It also 
includes a tutorial that introduces basic debugger functionality. 


TMS320C4x Technical Brief (literature number SPRUO76) gives a con- 
densed overview of the 'C4x DSP and its development tools. It also lists 
TMS320C4x third parties. 


TMS320 Family Development Support Reference Guide (literature number 
SPRUO011) describes the 320 family of digital signal processors and the 
various products that support it. This includes code-generation tools 
(compilers, assemblers, linkers, etc.) and system integration and debug 
tools (simulators, emulators, evaluation modules, etc.). This book also 
lists related documentation, outlines seminars and the university pro- 
gram, and gives factory repair and exchange information. 

TMS320 Third-Party Support Reference Guide (literature number 
SPRU052) alphabetically lists over 100 third parties that supply various 
products that serve the family of 320 digital signal processors—software 
and hardware development tools, speech recognition, image process- 
ing, noise cancellation, modems, etc. 

TMS320 DSP Designer’s Notebook: Volume 1 (literature number 
SPRT125) presents solutions to common design problems using ’C2x, 
’C3x, ’C4x, 'C5x, and other TI DSPs. 


Related Articles and Books 


A wide variety of related documentation is available on digital signal process- 
ing. These references fall into one of the following application categories: 


General-Purpose DSP 
Graphics/Imagery 
Speech/Voice 

Control 

Multimedia 

Military 
Telecommunications 
Automotive 
Consumer 

Medical 

Development Support 


HOUOCUUUOUKOOU 


In the following list, references appear in alphabetical order according to au- 
thor. The documents contain beneficial information regarding designs, opera- 
tions, and applications for signal-processing systems; all of the documents 
provide additional references. Texas Instruments strongly suggests that you 
refer to these publications. 


General-Purpose DSP: 


1) Antoniou, A., Digital Filters: Analysis and Design, New York, NY: 
McGraw-Hill Company, Inc., 1979. 


2) Brigham, E.O., The Fast Fourier Transform, Englewood Cliffs, NJ: Pren- 
tice-Hall, Inc., 1974. 
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3) Burrus, C.S., and T.W. Parks, DFT/FFT and Convolution Algorithms, New 
York, NY: John Wiley and Sons, Inc., 1984. 


4) Chassaing, R., Horning, D.W., “Digital Signal Processing with Fixed and 
Floating-Point Processors. ”CoED, USA, Volume 1, Number 1, pages 1-4, 
March 1991. 


5) Defatta, David J., Joseph G. Lucas, and William S. Hodgkiss, Digital Sig- 
nal Processing: A System Design Approach, New York: John Wiley, 1988. 


6) Erskine, C., and S. Magar, “Architecture and Applications of a Second- 
Generation Digital Signal Processor.” Proceedings of IEEE International 
Conference on Acoustics, Speech, and Signal Processing, USA, 1985. 


7) Essig, D., C. Erskine, E. Caudel, and S. Magar, “A Second-Generation 
Digital Signal Processor.” /EEE Journal of Solid-State Circuits, USA, Vol- 
ume SC-—21, Number 1, pages 86-91, February 1986. 


8) Frantz, G., K. Lin, J. Reimer, and J. Bradley, “The Texas Instruments 
TMS320C25 Digital Signal Microcomputer.” /EEE Microelectronics, USA, 
Volume 6, Number 6, pages 10-28, December 1986. 


9) Gass, W., R. Tarrant, T. Richard, B. Pawate, M. Gammel, P. Rajasekaran, 
R. Wiggins, and C. Covington, “Multiple Digital Signal Processor Environ- 
ment for Intelligent Signal Processing.” Proceedings of the IEEE, USA, 
Volume 75, Number 9, pages 1246-1259, September 1987. 


10) Gold, Bernard, and C.M. Rader, Digital Processing of Signals, New York, 
NY: McGraw-Hill Company, Inc., 1969. 


11) Hamming, R.W., Digital Filters, Englewood Cliffs, NJ: Prentice-Hall, Inc., 
1977. 


12) IEEE ASSP DSP Committee (Editor), Programs for Digital Signal Pro- 
cessing, New York, NY: IEEE Press, 1979. 


13) Jackson, Leland B., Digital Filters and Signal Processing, Hingham, MA: 
Kluwer Academic Publishers, 1986. 


14) Jones, D.L., and T.W. Parks, A Digital Signal Processing Laboratory Using 
the TMS32010, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


15) Lim, Jae, and Alan V. Oppenheim, Advanced Topics in Signal Processing, 
Englewood Cliffs, NJ: Prentice- Hall, Inc., 1988. 


16) Lin, K., G. Frantz, and R. Simar, Jr., “The TMS320 Family of Digital Signal 
Processors.” Proceedings of the IEEE, USA, Volume 75, Number 9, pages 
1143-1159, September 1987. 
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17) Lovrich, A., Reimer, J., “An Advanced Audio Signal Processor.” Digest of 
Technical Papers for 1991 International Conference on Consumer Elec- 
tronics, June 1991. 


18) Magar, S., D. Essig, E. Caudel, S. Marshall and R. Peters, “An NMOS Digi- 
tal Signal Processor with Multiprocessing Capability.” Digest of IEEE Inter- 
national Solid-State Circuits Conference, USA, February 1985. 


19) Morris, Robert L., Digital Signal Processing Software, Ottawa, Canada: 
Carleton University, 1983. 


20) Oppenheim, Alan V. (Editor), Applications of Digital Signal Processing, 
Englewood Cliffs, NJ: Prentice-Hall, Inc., 1978. 


21) Oppenheim, Alan V., and R.W. Schafer, Digital Signal Processing, Engle- 
wood Cliffs, NJ: Prentice-Hall, Inc., 1975 and 1988. 


22) Oppenheim, A.V., A.N. Willsky, and |.T. Young, Signals and Systems, En- 
glewood Cliffs, NJ: Prentice-Hall, Inc., 1983. 


23) Papamichalis, P.E.,and C.S. Burrus, “Conversion of Digit-Reversed to Bit- 
Reversed Order in FFT Algorithms.” Proceedings of ICASSP 89, USA, 
pages 984-987, May 1989. 


24) Papamichalis, P., and R. Simar, Jr., “The TMS320C30 Floating-Point Digi- 
tal Signal Processor.” /EEE Micro Magazine, USA, pages 13-29, Decem- 
ber 1988. 


25) Parks, T.W., and C.S. Burrus, Digital Filter Design, New York, NY: John 
Wiley and Sons, Inc., 1987. 


26) Peterson, C., Zervakis, M., Shehadeh, N., “Adaptive Filter Design and 
Implementation Using the TMS320C25 Microprocessor.” Computers in 
Education Journal, USA, Volume 3, Number 3, pages 12-16, July—Sep- 
tember 1993. 


27) Prado, J., and R. Alcantara, “A Fast Square-Rooting Algorithm Using a 
Digital Signal Processor.” Proceedings of IEEE, USA, Volume 75, Number 
2, pages 262-264, February 1987. 


28) Rabiner, L.R. and B. Gold, Theory and Applications of Digital Signal Pro- 
cessing, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1975. 


29) Simar, Jr., R., and A. Davis, “The Application of High-Level Languages to 
Single-Chip Digital Signal Processors.” Proceedings of ICASSP 88, USA, 
Volume D, page 1678, April 1988. 


30) Simar, Jr., R., T. Leigh, P. Koeppen, J. Leach, J. Potts, and D. Blalock, “A 
40 MFLOPS Digital Signal Processor: the First Supercomputer on a Chip.” 
Proceedings of ICASSP 87, USA, Catalog Number 87CH2396-0, Volume 
1, pages 535-538, April 1987. 


31) Simar, Jr., R., andJ. Reimer, “The TMS320C25: a 100 ns CMOS VLSI Dig- 


ital Signal Processor.” 1986 Workshop on Applications of Signal Process- 
ing to Audio and Acoustics, September 1986. 


32) Texas Instruments, Digital Signal Processing Applications with the 


TMS320 Family, 1986; Englewood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


33) Treichler, J.R., C.R. Johnson, Jr., and M.G. Larimore, A Practical Guide 


to Adaptive Filter Design, New York, NY: John Wiley and Sons, Inc., 1987. 


Graphics/Imagery: 


1) 


Andrews, H.C., and B.R. Hunt, Digital Image Restoration, Englewood 
Cliffs, NJ: Prentice-Hall, Inc., 1977. 


Gonzales, Rafael C., and Paul Wintz, Digital Image Processing, Reading, 
MA: Addison-Wesley Publishing Company, Inc., 1977. 


Papamichalis, P.E., “FFT Implementation on the TMS320C30.” Proceea- 
ings of ICASSP 88, USA, Volume D, page 1399, April 1988. 


Pratt, William K., Digital Image Processing, New York, NY: John Wiley and 
Sons, 1978. 


Reimer, J., and A. Lovrich, “Graphics with the TMS32020.” WESCON/85 
Conference Record, USA, 1985. 


Speech/Voice: 


1) 


2) 


DellaMorte, J., and P. Papamichalis, “Full-Duplex Real-Time Implementa- 
tion of the FED-STD-1015 LPC-10e Standard V.52 on the TMS320C25.” 
Proceedings of SPEECH TECH 839, pages 218-221, May 1989. 


Frantz, G.A., and K.S. Lin, “A Low-Cost Speech System Using the 
TMS320C17.” Proceedings of SPEECH TECH ’87, pages 25-29, April 
1987. 


Gray, A.H., and J.D. Markel, Linear Prediction of Speech, New York, NY: 
Springer-Verlag, 1976. 


Jayant, N.S., and Peter Noll, Digital Coding of Waveforms, Englewood 
Cliffs, NJ: Prentice-Hall, Inc., 1984. 


Papamichalis, Panos, Practical Approaches to Speech Coding, Engle- 
wood Cliffs, NJ: Prentice-Hall, Inc., 1987. 


Papamichalis, P., and D. Lively, “Implementation of the DOD Standard 
LPC-—10/52E on the TMS320C25.” Proceedings of SPEECH TECH ’87, 
pages 201-204, April 1987. 


Pawate, B.I., and G.R. Doddington, “Implementation of a Hidden Markov 
Model-Based Layered Grammar Recognizer.” Proceedings of ICASSP 
89, USA, pages 801-804, May 1989. 


Rabiner, L.R., and R.W. Schafer, Digital Processing of Speech Signals, 
Englewood Cliffs, NJ: Prentice-Hall, Inc., 1978. 


9) Reimer, J.B. and K.S. Lin, “TMS320 Digital Signal Processors in Speech 
Applications.” Proceedings of SPEECH TECH ’88, April 1988. 


10) Reimer, J.B., M.L. McMahan, and W.W. Anderson, “Speech Recognition 
for a Low-Cost System Using a DSP.” Digest of Technical Papers for 1987 
International Conference on Consumer Electronics, June 1987. 


Control: 


1) Ahmed, I., “16-Bit DSP Microcontroller Fits Motion Control System Ap- 
plication.” PC/M, October 1988. 


2) Ahmed, I., “Implementation of Self Tuning Regulators with TMS320 Fami- 
ly of Digital Signal Processors.” MOTORCON ’88, pages 248-262, Sep- 
tember 1988. 


3) Ahmed, I., and S. Lindquist, “Digital Signal Processors: Simplifying High- 
Performance Control.” Machine Design, September 1987. 


4) Ahmed, I., and S. Meshkat, “Using DSPs in Control.” Control! Engineering, 
February 1988. 


5) Allen, C. and P. Pillay, “TMS320 Design for Vector and Current Control of 
AC Motor Drives.” Electronics Letters, UK, Volume 28, Number 23, pages 
2188-2190, November 1992. 


6) Bose, B.K., and P.M. Szczesny, “A Microcomputer-Based Control and 
Simulation of an Advanced IPM Synchronous Machine Drive System for 
Electric Vehicle Propulsion.” Proceedings of IECON ’87, Volume 1, pages 
454-463, November 1987. 


7) Hanselman, H., “LQG-Control of a Highly Resonant Disc Drive Head Posi- 
tioning Actuator.” /JEEE Transactions on Industrial Electronics, USA, Vol- 
ume 35, Number 1, pages 100-104, February 1988. 


8) Jacquot, R., Modern Digital Control Systems, New York, NY: Marcel Dek- 
ker, Inc., 1981. 


9) Katz, P., Digital Control Using Microprocessors, Englewood Cliffs, Nu: 
Prentice-Hall, Inc., 1981. 


10) Kuo, B.C., Digital Control Systems, New York, NY: Holt, Reinholt, and 
Winston, Inc., 1980. 


11) Lovrich, A., G. Troullinos, and R. Chirayil, “An All-Digital Automatic Gain 
Control.” Proceedings of ICASSP 88, USA, Volume D, page 1734, April 
1988. 


12) Matsui, N. and M. Shigyo, ‘Brushless DC Motor Control Without Position 
and Speed Sensors.” /EEE Transactions on Industry Applications, USA, 
Volume 28, Number 1, Part 1, pages 120-127, January—February 1992. 
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13) Meshkat, S., and |. Ahmed, “Using DSPs in AC Induction Motor Drives.” 
Control Engineering, February 1988. 


14) Panahi, |. and R. Restle, ‘DSPs Redefine Motion Control.” Motion Control 
Magazine, December 1993. 


15) Phillips, C., and H. Nagle, Digital Control System Analysis and Design, En- 
glewood Cliffs, NJ: Prentice-Hall, Inc., 1984. 


Multimedia: 


1) Reimer, J., ‘DSP-Based Multimedia Solutions Lead Way Enhancing Audio 
Compression Performance.” Dr. Dobbs Journal, December 1993. 


2) Reimer, J., G. Benbassat, and W. Bonneau ur., “Application Processors: 
Making PC Multimedia Happen.” Silicon Valley PC Design Conference, 
July 1991. 


Military: 


1) Papamichalis, P., and J. Reimer, “Implementation of the Data Encryption 
Standard Using the TMS32010.” Digital Signal Processing Applications, 
1986. 


Telecommunications: 


1) Ahmed, I|., and A. Lovrich, “Adaptive Line Enhancer Using the 
TMS320C25.” Conference Records of Northcon/86, USA, 14/3/1-10, 
September/October 1986. 


2) Casale, S.,R. Russo, and G. Bellina, “Optimal Architectural Solution Us- 
ing DSP Processors for the Implementation of an ADPCM Transcoder.” 
Proceedings of GLOBECOM ’89, pages 1267-1273, November 1989. 


3) Cole, C., A. Haoui, and P. Winship, “A High-Performance Digital Voice 
Echo Canceller on a SINGLE TMS32020.” Proceedings of ICASSP 86, 
USA, Catalog Number 86CH2243-4, Volume 1, pages 429-432, April 
1986. 


4) Cole, C., A. Haoui, and P. Winship, “A High-Performance Digital Voice 
Echo Canceller on a Single TMS32020.” Proceedings of IEEE Internation- 
al Conference on Acoustics, Speech and Signal Processing, USA, 1986. 


5) Lovrich, A., and J. Reimer, “A Multi-Rate Transcoder.” Transactions on 
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6) Lovrich, A. and J. Reimer, “A Multi-Rate Transcoder.” Digest of Technical 
Papers for 1989 International Conference on Consumer Electronics, June 
7-9, 1989. 


7) Lu,H., D. Hedberg, and B. Fraenkel, “Implementation of High-Speed Voi- 
ceband Data Modems Using the TMS320C25.” Proceedings of ICASSP 
87, USA, Catalog Number 87CH2396-0, Volume 4, pages 1915-1918, 
April 1987. 


8) Mock, P., “Add DTMF Generation and Decoding to DSP— uP Designs.” 
Electronic Design, USA, Volume 30, Number 6, pages 205-213, March 
1985. 


9) Reimer, J., M. McMahan, and M. Arjmand, “ADPCM on a TMS320 DSP 
Chip.” Proceedings of SPEECH TECH 85, pages 246-249, April 1985. 


10) Troullinos, G., and J. Bradley, “Split-Band Modem Implementation Using 
the TMS32010 Digital Signal Processor.” Conference Records of 
Electro/86 and Mini/Micro Northeast, USA, 14/1/1-21, May 1986. 


Automotive: 


1) Lin, K., “Trends of Digital Signal Processing in Automotive.” /nternational 
Congress on Transportation Electronic (CONVERGENCE ’88), October 
1988. 


Consumer: 


1) Frantz, G.A., J.B. Reimer, and R.A. Wotiz, “Julie, The Application of DSP 
to a Product.” Speech Tech Magazine, USA, September 1988. 


2) Reimer, J.B., and G.A. Frantz, “Customization of a DSP Integrated Circuit 
for a Customer Product.” Transactions on Consumer Electronics, USA, 
August 1988. 


3) Reimer, J.B., P.E. Nixon, E.B. Boles, and G.A. Frantz, “Audio Customiza- 
tion of a DSP IC.” Digest of Technical Papers for 1988 International Con- 
ference on Consumer Electronics, June 8-10 1988. 
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1) Knapp and Townshend, “A Real-Time Digital Signal Processing System 
for an Auditory Prosthesis.” Proceedings of ICASSP 88, USA, Volume A, 
page 2493, April 1988. 


2) Morris, L.R., and P.B. Barszczewski, “Design and Evolution of a Pocket- 
Sized DSP Speech Processing System for a Cochlear Implant and Other 
Hearing Prosthesis Applications.” Proceedings of ICASSP 88, USA, Vol- 
ume A, page 2516, April 1988. 
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1) Mersereau, R., R. Schafer, T. Barnwell, and D. Smith, “A Digital Filter De- 
sign Package for PCs and TMS320.” MIDCON/84 Electronic Show and 
Convention, USA, 1984. 


2) Simar, Jr., R., and A. Davis, “The Application of High-Level Languages to 


Single-Chip Digital Signal Processors.” Proceedings of ICASSP 88, USA, 
Volume 3, pages 1678-1681, April 1988. 
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Trademarks 


MS is a registered trademark of Microsoft Corp. 
MS-Windows is a registered trademark of Microsoft Corp. 


MS-DOS is a registered trademark of Microsoft Corp. 


OS/2 is a trademark of International Business Machines Corp. 


Sun and SPARC are trademarks of Sun Microsystems, Inc. 


VAX and VMS are trademarks of Digital Equipment Corp. 
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Chapter 1 


Processor Initialization 


Before you execute a DSP algorithm, itis necessary to initialize the processor. 
Initialization brings the processor to a known state. Generally, initialization 
takes place any time after the processor is reset. This chapter reviews the con- 
cepts explained in the user’s guide and provides examples. 


Topic Page 
le Lite elke lease enacnasonorenanonecnnaehanonsceorernsasuacde 1-2 
1.2 Reset Signal Generation ................. cece eee eee eee eens 1-3 
1.3 Multiprocessing System Reset Considerations .................. 1-5 
1.4 How to Initialize the Processor .............02000eeeee eee eres 1-6 


Reset Process 


1.1 Reset Process 


After RESET is applied, the ’C4x jumps to the address stored in the reset vec- 
tor location and starts execution from that point. 


In order to reset the ’C4x correctly, you need to comply with several hardware 
and software requirements: 


Ld 


QO 


Select the reset vector location: 


The RESET vector of the ’C4x can be mapped to one of four different 
locations that are controlled by the value of the RESETLOC(1,0) pins 
at RESET. Table 1-1 shows possible reset vectors for the C40 and 
C44. 


If the DSP is in microcomputer mode (ROMEN pin =1), RESET- 
LOC(1,0) must be equal to 0,0 for the boot loader to operate correctly. 


If the DSP is in microcomputer mode, set the IIOFx pins as discussed in 
the bootloader chapter TMS320C4x User’s Guide so that the bootloader 
works properly. 


Provide the correct reset vector value: 


The RESET vector normally contains the address of the system initial- 
ization routine. 


In microcomputer mode the reset vector is initialized automatically by 
the processor to point to the beginning of the on-chip boot loader 
code. No user action is required. 


In microprocessor mode, the reset vector is typically stored in an 
EPROM. Example 1-1 shows how you can initialize that vector. 


Apply a low level to the RESET input. (See section 1.2). 


Table 1-1. RESET Vector Locations in the ’C40 and ’C44 


Value at RESETLOCx Pin 


Get Reset Vector From 


~ RESETLOC1 | RESETLOCO Hex Memory Address Bus 
0 0 00000 0000 Local 
0 1 O7FFF FFFFT Local 
1 0 08000 oooot Global 
1 1 OFFFF FFFFT Global 


T This corresponds to the 32-bit address that the processor accesses. However, in the ’C44 only 
the 24-LSBs of the reset address are driven on pins AO—A23 and pins LAO-LA23. The corre- 
sponding LSTRBx pins are also activated. 


Reset Signal Generation 


1.2 Reset Signal Generation 


Several aspects of ’C4x system hardware design are critical to overall system 
operation. One such aspect is reset signal generation. 


The reset input controls initialization of internal ’C4x logic and execution of the 
system initialization software. For proper system initialization, the RESET sig- 
nal must be applied for at least ten H1 cycles, that is, 400 ns for a ’C4x operat- 
ing at 50 MHz. Upon power up, however, it can take 20 ms or more before the 
system oscillator reaches a stable operating state. Therefore, the power-up 
reset circuit should generate a low pulse on the RESET pin for 100 to 200 ms. 
Once a proper reset pulse has been applied, the processor fetches the reset 
vector from location zero, which contains the address of the system initializa- 
tion routine. Figure 1-1 shows a circuit that will generate an appropriate pow- 
er-up or push-button reset signal. 


Figure 1—1. Reset Circuit 


TMS320C4x 


Reset 
O 


+5 V 


7 74ALS34 
Ry = 100 kQ S 


~ 


Cy =4.7 uF AN 


L 


The voltage on the RESET pin is controlled by the RyC network. After a reset, 
this voltage rises exponentially according to the time constant R1C 1, as shown 
in Figure 1—2. In Figure 1-1, the 74ALS34 provides a clean RESET signal to 
the ’C4x. 
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Reset Signal Generation 


Figure 1-2. Voltage on the RESET Pin 


Voltage ~ 
J V=Vcc (1 —e-t/T) 
Voc 
V4 oF 
| 
| > Time 
to=0 ty 


The duration of the low pulse on the RESET pin is approximately ty, which is 
the time it takes for the capacitor C; to be charged to 1.5 V. This is approxi- 
mately the voltage at which the reset input switches from a logic 0 to a logic 
1. The capacitor voltage is expressed as 


Ve Veo 1 = e+] (5) 


where t = RC; is the reset circuit time constant. Solving (5) for t results in 


~ Voc 


t=- PCa V | (6) 


Setting the following: 

Ry = 100 kQ 

Cy =4.7 uF 

Voc =5V 

V=Vy=1.5V 

results in t = 167 ms. Therefore, the reset circuit of Figure 1-1 provides a low 


pulse for along enough time to ensure the stabilization of the system oscillator 
upon powerup. 


| 


Note: 


Reset does not have internal Schmidt hysteresis. To ensure proper reset op- 
eration, avoid low rise and fall times. Rise/fall time should not exceed one 
CLKIN cycle. 


td 


Multiprocessing System Reset Considerations 


1.3 Multiprocessing System Reset Considerations 


If synchronization of multiple ’C4x DSPs is required, all processors should be 
provided with the same input clock and the same reset signal. After powerup, 
when the clock has stabilized, set RESET high for a few H1/H3 cycles and then 
set it low to synchronize their H1/H3 clock phases. Following the falling edge, 
RESET should remain low for at least ten H1 cycles and then be driven high. 
The circuit in Figure 1-1 can be used for RESET generation. 


Pullup resistors are recommended at each end of the connection to avoid unin- 
tended triggering after reset when RESET going low is not received on all’C4x 
devices at the same time. 


It is recommended that you power up the system with RESET low. This 
prevents ‘C4x asynchronous signals from driving unknown values 


before RESET goes low, which could create bus contention in 
communication-port pins, resulting in damage to the device. 
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1.4 How to Initialize the Processor 


After reset, the C4x jumps to the address stored in the reset vector location and 
starts execution from that point. The RESET vector normally contains the ad- 
dress of the system initialization routine. 


The initialization routine should typically perform several tasks: 


Set the DP register. 

Set the stack pointer. 

Set the interrupt vector table. 
Set the trap vector table. 

Set the memory control register. 
Clear/enable cache. 


OHOUUUU 


Note: 


When running under microcomputer mode (ROMEN = 1). The address 
stored in the reset vector location points to the beginning of the bootloader 
code. The on-chip bootloader automatically initializes the memory-control 
register values from the bootloader table 
| 
The following examples illustrate how to initialize the ‘C4x when using assem- 
bly language and when using C. 


Processor initialization under assembly language 


If you are running under an assembly-only environment, Example 1-1 pro- 
vides a basic initialization routine. This example shows code for initializing the 
’C4x to the following machine state: 

Timer 0 interrupt is enabled. 

Trap 0 is initialized. 

The program cache is enabled. 

The DP is initialized to point to the .text section. 

The stack pointer is initialized to the beginning of the mystack section. 
The memory control registers are initialized. 


The ’C4x is initialized to run in microcontroller mode with the reset vector 
located at address 08000 0000h (RESETLOC(1,0)=1,0). 


The program has already been loaded into memory location at address = 
0x4000 0000. 


OOOOOUO 


uu 


You need to allocate the section addresses using a linker command file (see 
the TMS320 Floating-Point DSP Assembly Language Tools User’s Guide 
book for more information about linker command files) as shown in 
Example 1-2. 
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Example 1—1.Processor Initialization Example 


7 Create Reset Vector 


.sect "“rst_sect” ;Named section for RESET vector 
reset -word init ;RESET vector 


¥ 

; Create Interrupt Vector Table 

7 

_myvect .sect "myvect” ;Named section for int. vectors 
.space2 ;Reserved space 
-word tint0O ;Timer O ISR address 


; Create Trap Vector Table 

eee .sect “mytrap” ; named section for trap vectors 
.word trap0 ;Trap O subroutine address 

: Create Stack 

eat ince aeeets ; reserve 500 locations for 


; stack 
-text 

stacka -word _mystack ; address of mystack section 
ivta -word _myvect ; address of myvect section 
tvta -word _mytrap ; address of mytrap section 
ieval -word 1 ; IE register value 

gctrl -word ?7272????? ; target board specific 
letrl -word ?727????? ; target board specific 


mctrla -word 100000h ; address of the global memory 
; control register 

init: 

, 


; Initialize the DP Register 


ldp stacka 


; Set Expansion Register IVTP 


LDI @ivta, ARO 
iDPE ARO, IVTP 


5 


; Set Expansion Register TVTIP 


LDI @tvta, ARO 
iDPE ARO, TVTP 


Processor Initialization 1-7 


How to Initialize the Processor 


Example 1-1. Processor Initialization Example (Continued) 


’ 


£ 


Initialize global memory interface control 


ldi @mctrla,ar0 
LDI @gctrl,RO 
STI RO, *ARO 


Initialize local memory interface control 


LDI 
STL 


@lctrl1,RO 
RO, *+ARO (4) 


Initialize the Stack Pointer 


LDI @stacka, SP 


Enable timer interrupt 
This is equivalent to ldi 1,iie 


LDI 


@ieval, IIE 


Clear/Enable Cache and Enable Global Interrupts 


OR 3800H, ST 7 


Global interrupt enable 


BR BEGIN ; Branch to the beginning of 
; the application 
begin 
< this is your application code> 
trap0 
< this is your trap0O trap code> 
reti 
tintod 
< this is your tintO interrupt 
service routine> 
reti 
.end 
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Example 1-2.Linker Command File for Linking the Previous Example 


MEMORY 

{ 
EPROM: org = 0x80000000 len = 0x10 /* EPROM reset vector location */ 
RAM: org = 0x40000000 len = 0x100 /* extend RAM */ 


/* SPECIFY THE SECTIONS ALLOCATION INTO M 


SECTIONS 
{ 


rst_sect: > EPROM 
myvect: > RA 
mystack: > RAM 
stext: > RAM 
mytrap: > RA 


EMORY */ 


Processor initialization under C language 


If you are running under a C environment, your initialization routine is typically 


boot.asm (from the RTS40.LIB library that comes with the floating-point 


compiler). In addition to initializing global variables, boot.asm initializes the DP 
register (pointing to the .oss section) and the SP register (pointing to the .stack 
section). You need to enable the cache, as shown in Example 1-3, and setup 
your interrupts inside your main routine before you enable interrupts. See the 
Application Report, Setting Up TMS320 DSP Interrupts in C (SPRAO036), for 


more information. 


Example 1—3.Enabling the Cache 


main() { 
asm(” or 1800,st”) ; enable cache 


} 


/* asm(” or 3800,st”) */ ; enable cache and interrupts 


Processor Initialization 
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Program Control 


Several ’C4x instructions provide program control and facilitate high-speed 
processing. These instructions directly handle: 


(1 Regular and zero-overhead subroutine calls 

Lj Software stack 

[j Interrupts 

(j] Delayed branches 

[) Single- and multiple-instruction loops without overhead 
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Subroutines 


2.1. Subroutines 


The ’C4x provides two ways to invoke subroutine calls: regular calls and zero- 
overhead calls. The regular and zero-overhead subroutine calls use the soft- 
ware stack and extended-precision register R11, respectively, to save the re- 
turn address. The following subsections use example programs to explain how 
this works. 


2.1.1. Regular Subroutine Calls 
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The ’C4x has a 32-bit program counter (PC) and a virtually unlimited software 
stack. The CALL and CALLcond subroutine calls increment the stack pointer 
and store the contents of the next value of the PC counter on the stack. At the 
end of the subroutine, RETScond performs a conditional return. 


Example 2-1 illustrates the use of a subroutine to determine the dot product 
of two vectors. Given two vectors of length N, represented by the arrays a0], 
a[1], ..., alN—1] and b[0], b[1],..., b[N—1], the dot product is computed from the 
expression 


d = a0] b[0] + a[1] b[1] +... + a[N-1] b[N-1] 


Processing proceeds in the main routine to the point where the dot product is 
to be computed. It is assumed that the arguments of the subroutine have been 
appropriately initialized. At this point, a CALL is made to the subroutine, trans- 
ferring control to that section of the program memory for execution, then re- 
turning to the calling routine via the RETS instruction when execution has com- 
pleted. Note that for this particular example, it would suffice to save the register 
R2. However, a larger number of registers are saved for demonstration pur- 
poses. The saved registers are stored on the system stack, which should be 
large enough to accommodate the maximum anticipated storage require- 
ments. Other methods of saving registers could be used equally well. 


Example 2—1.Regular Subroutine Call (Dot Product) 


Subroutines 


TE 


vs) 


MAIN ROUTINE 


+ + F F HF 


LDI 
LDI 
LDI 
CALL 


+ 


* SUBROUTINE DOT 


+ + + + F FH 
re 
re) 


4 
ize 
GI 


DOT PRODUCT OF TWO V 


THAT CALLS THE SUBROUTINE ‘DOT’ TO COMPUTE TH 


EGULAR SUBROUTINE CALL (DOT PRODUCT) 


ica) 


@b1k0, ARO 
@b1k1,AR1 
N, RC 
DOT 


UATION: d = a(0) * b(0) 


ARGUMENT ASSIGNMENTS: 
| FUNCTION 


ARGUMENT 


ECTORS. 


7;ARO points to vector a 
;AR1L points to vector b 
7RC contains the number of elements 


DOT PRODUCT OF a AND b IS PLACED IN REGISTER RO. N MUST 
*BE GREATER THAN OR 


EQUAL TO 2. 


ADDRESS OF a(0) 


EGISTERS US 


Aww 


EGISTER CON 


+ + + F FF FF F F FF F HF FH 
ve) 
Q 


-global DOT 


* 


DOTPUSH ST 
PUSH 
PUSHF 
PUSH 
PUSH 
PUSH 
PUSH 
PUSH 


| 
| ADDRESS OF b(0) 
| LENGTH OF VECTORS (N) 


ED AS INPUT: 
EGISTER MODIFIED: RO 
TAINING RESULT: RO 


ARO, AR1, RC 


;Save status register 


R2 
R2 
ARO 
AR1 


R 
R 


* Initialize RO: 


MPYF3 


*BRO, *AR1, 


;Use the stack to save R2’s 
;bottom 32 and top 32 bits 
;Save ARO 
; Save AR1 
7; Save RC 


RO;a(0) * b(0) -> RO 
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Example 2-1.Regular Subroutine Call (Dot Product) (Continued) 


|| SUBF R2,R2,R2 ;Initialize R2. 
SUBI 2,RC ;Set RC = N-2 
* 
* 
* DOT PRODUCT (1 <= i < N)* 
RPTS RC ; Setup the repeat single. 
MPYF3 *++ARO(1),*++AR1(1),RO ; a(i) * b(i) -> RO 
| | ADDF3 RO,R2,R2 ; a(i-1)*b(i-1) + R2 -> R2 
* 
ADDF3 RO,R2,RO0 ; a(N-1)*b(N-1) + R2 -> RO 
* 
* 
* RETURN SEQUENCE 
* 
POP RE 
POP RS 
POP RC ;Restore RC 
POP AR1L ;Restore ARL 
POP ARO ;Restore ARO 
POPF R2 ;Restore top 32 bits of R2 
POP R2 ;Restore bottom 32 bits of R2 
POP ST ;Restore ST 
RETS ;Return 
* 
* end 
* 
.end 


2.1.2 Zero-Overhead Subroutine Calls 


Two instructions, link and jump (LAJ) and link and jump conditional (LAJcona), 
implement zero-overhead subroutine calls to be implemented on the ’C4x. Un- 
like CALL and CALLcond, which put the value of PC + 1 into the software stack, 
LAJ and LAJcond put the value of PC + 4 into extended-precision register R11. 
Three instructions following LAJ or LAJcond are executed before going to the 
subroutine. The restriction that applies to these three instructions is the same 
as that of the three instructions following a delayed branch. At the end of the 
subroutine, you can use a delayed branch conditional, BconaD, in the register 
addressing mode with R11 as source, to perform a zero-overhead subroutine 
return. 


For comparison, the same dot product example with a zero-overhead subrou- 
tine call is given in the following example program. 
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Example 2-2. Zero-Overhead Subroutine Call (Dot Product) 


Subroutines 


* 
* TITLE ZERO-OVERHEAD SUBROUTINE CALL (DOT PRODUCT) 
* 
* 
* MAIN ROUTINE THAT CALLS THE SUBROUTINE ‘DOT’ TO COMPUTE THE 
bad DOT PRODUCT OF TWO VECTORS. 
LAJ DOT 
LDI @b1k0, ARO ; ARO points to vector a 
LDI @b1k1,AR1 ; AR1 points to vector b 
LDI N,RC ; RC contains the number of elements 
* 
* SUBROUTINE DOT 
* 
*EQUATION: d= a(0) * b(0) + a(1) * b(1) + ... + a(N-1) * b(N-1) 
* 
* THE DOT PRODUCT OF a AND b IS PLACED IN REGISTER RO. N MUST 
* BE GREATER THAN OR EQUAL TO 2. 
* 
* ARGUMENT ASSIGNMENTS: 
bad ARGUMEN FUNCTION 
* SS re ree 
* ARO ADDRESS OF a(0) 
* AR1 ADDRESS OF b(0) 
* RC LENGTH OF VECTORS (N) 
* 
- REGISTERS USED AS INPUT: ARO, AR1, RC 
* REGISTER MODIFIED: RO 
* REGISTER CONTAINING RESULT: RO 
* 
* 
* 
-global DOT 
* 
DOT PUSH ST ;Save status register 
PUSH R2 ;Use the stack to save R2’s 
PUSHF R2 ;bottom 32 and top 32 bits 
PUSH ARO ;Save ARO 
PUSH AR1L ;Save AR1 
PUSH RC ;Save RC 
PUSH RS 
PUSH RE 
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Subroutines 


Example 2-2. Zero-Overhead Subroutine Call (Dot Product) (Continued) 


* Initialize RO: 
MPYF3 

| | SUBF 
SUBI 


* DOT PRODUCT (1 
RPTS 
MPYF3 

|| ADDF3 


ADDF3 


* 
vs) 


ETURN SEQUENC 


POP 
POP 
POP 
POP 
POP 
BUD 
POPF 
POP 
POP 


* end 


.end 


*ARO, *AR1, RO 7a(O0) * b(0) —> RO 
R2,R2,R2 ;Initialize R2. 

2, RC ;Set RC = N-2 

<= i< N) 

RC ; Setup the repeat single 
*++AR0(1),*++AR1(1),RO; a(i) * b(i) -> RO 
RO,R2,R2 ; a(i-1)*b(i-1) + R2 -> R2 
RO,R2,RO0 ; a(N-1)*b(N-1) + R2 -> RO 
RE 

RS 

RC ,;Restore RC 

ARI ;Restore AR1L 

ARO ;Restore ARO 

R11 ; Return 

R2 ;Restore top 32 bits of R2 
R2 ;Restore bottom 32 bits of R2 
hopll ;Restore ST 
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2.2 Stacks and Queues 


2.2.1 


The ’C4x provides a dedicated stack pointer (SP) for building stacks in 
memory. Also, the auxiliary registers can be used to build user stacks and a 
variety of more general linear lists. This section discusses the implementation 
of the following types of linear lists: 


Stack A linear list for which all insertions and deletions are made at one 
end of the list. 


Queue A linear list for which all insertions are made at one end of the 
list, and all deletions are made at the other end. 


Dequeue A double-ended queue linear list for which insertions and dele- 
tions are made at either end of the list. 


System Stacks 


Astack in the ’C4x fills from a low-memory address to a high-memory address, 
as is shown in Figure 2-1. A system stack stores addresses and data during 
subroutine calls, traps, and interrupts. 


The stack pointer (SP) is a 32-bit register that contains the address of the top 
of the system stack. The SP always points to the last element pushed onto the 
stack. A push performs a preincrement, and a pop performs a postdecrement 
of the SP. Provisions should be made to accommodate your software’s antici- 
pated storage requirements. 


The stack pointer (SP) can be read from as well as written to; multiple stacks 
can be created by updating the SP. The SP is not initialized by the hardware 
during reset; it is important to remember to initialize its value so that the it points 
to a predetermined memory location. Example 1—1 on page 1-7, shows how 
to initialize the SP. You must initialize the stack to a valid free memory space. 
Otherwise, use of the stack could corrupt data or program memory. 


The program counter is pushed onto the system stack on subroutine calls, 
traps, and interrupts. Itis popped from the system stack on returns. The PUSH, 
POP, PUSHF, and POPF instructions push and pop the system stack. The 
stack can be used inside of subroutines as a place of temporary storage of reg- 
isters, as is the case shown in Example 2-1, on page 2-3. 
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Two instructions, PUSHF and POPF, are for floating-point numbers. These in- 
structions can pop and push floating-point numbers to registers RO — R11. This 
feature is very useful for saving the extended-precision registers (see 
Example 2—1 and Example 2-2). PUSH saves the lower 32 bits of an extended- 
precision register, and PUSHF saves the upper 32 bits. To recover this exten- 
ded-precision number, execute a POPF followed by POP. It is important to per- 
form the integer and floating-point PUSH and POP in the above order, since 
POPF forces the last eight bits of the extended-precision registers to zero. 


Figure 2—1. System Stack Configuration 


2.2.2 User Stacks 
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Low Memory 


Bottom of stack 


Top of stack 


(Free) 


High Memory 


User stacks can be built to store data from low-to-high memory or from high-to- 
low memory. Two cases for each type of stack are shown. You can build stacks 
by using the preincrement/decrement and postincrement/decrement modes 
of modifying the auxiliary registers (AR). 


You can implement stack growth from high to low memory in two ways: 


Case 1: Store to memory using *- —ARnto push data onto the stack, and read 
from memory using *“ARn++ to pop data off the stack. 


Case 2: Store to memory using *~ARn—- to push data onto the stack, and read 
from memory using * ++ARn to pop data off the stack. 


Figure 2-2 illustrates these two cases. The only difference is that in case 1, 
the AR always points to the top of the stack, and incase 2, the AR always points 
to the next free location on the stack. 
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Figure 2-2. Implementations of High-to-Low Memory Stacks 


Low Memory Low Memory 
(Free) (Free) 


ARn —> Top of stack Top of stack 


Bottom of stack Bottom of stack 
High Memory High Memory 
Case 1 Case 2 


You can implement stack growth from low to high memory in two ways: 


Case 3: Store to memory using *++ARnto push data onto the stack, and read 
from memory using *~ARn-—-— to pop data off the stack. 


Case 4: Store to memory using *~ARn++ to push data onto the stack, and read 
from memory using *-—ARn to pop data off the stack. 


Figure 2—3 shows these two cases. In case 3, the AR always points to the top 
of the stack. In case 4, the AR always points to the next free location on the 
stack. 


Figure 2—3. Implementations of Low-to-High Memory Stacks 


Low Memory Low Memory 


Bottom of stack Bottom of stack 


Top of stack Top of stack 
ARn —> 
High Memory High Memory 
Case 3 Case 4 


2.2.3 Queues and Double-Ended Queues 


The implementations of queues and double-ended queues is based upon the 
manipulation of the auxiliary registers for user stacks. 
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Stacks and Queues 


For queues, two auxiliary registers are used: one to mark the front of the queue 
from which data is popped and the other to mark the rear of the queue to where 
data is pushed. 


For double-ended queues, two auxiliary registers are also necessary. One 
register marks one end of the double-ended queue, and the other register 
marks the other end. Data can be popped from or pushed onto either end. 


Interrupt Examples 


2.3 Interrupt Examples 


When using interrupts, you must consider several issues. This section offers 
examples of several interrupt-related topics: 


Interrupt Service Routines 
Context Switching 
Interrupt-Vector Table (IVTP) 
Interrupt Priorities 


OOO 


2.3.1. Correct Interrupt Programming 


For interrupts to work properly you need to execute the following sequence of 
steps, as is shown in Example 1-1: 


— 


Set the interrupt-vector table in a 512-word boundary. 


) 
2) Initialize the IVTP register. 
3) Create a software stack. 
4) Enable the specific interrupt. 
5) Enable global interrupts. 
6) Generate the interrupt signal. 


2.3.2 Software Polling of Interrupts 


The interrupt flag register can be polled, and action can be taken, depending 
on whether an interrupt has occurred. This is true even when maskable inter- 
rupts are disabled. This can be useful when an interrupt-driven interface is not 
implemented. Example 2—3 shows the case in which a subroutine is called 
when external interrupt 1 has not occurred. 


Example 2-3. Use of Interrupts for Software Polling 


* TITLE INTERRUPT POLLING 


TSTB 40H, IIF ;Test if interrupt 1 has occurred 
CALLZ SUBROUTINE ;If not, call subroutine 


When interrupt processing begins, the program counter is pushed onto the 
stack, and the interrupt vector is loaded in the program counter. Interrupts are 
disabled when GIE is cleared to 0 and the program continues from the address 
loaded in the program counter. Because all maskable interrupts are disabled, 
interrupt processing can proceed without further interruption unless the inter- 
rupt service routine re-enables interrupts, or the NMI occurs. 
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2.3.3 Using One Interrupt for Two Services 


The IVTP can be changed to point to alternate interrupt-vector tables. This re- 
locatable feature of the table allows you to use a single interrupt signal for more 
than one service. 


In Example 2-4, the IVTP is reset in the external INTO interrupt service rou- 
tines EINTOA and EINTOB. After the value of the IVTP is changed, the CPU 
goes to a different interrupt service routine when the same interrupt signal re- 
occurs. 


Example 2-4. Use of One Interrupt Signal for Two Different Services 


TITLE USE OF ONE INTERRUPT SIGNAL FOR TWO DIFFERENT SERVICES 


IN THIS EXAMPLE, THE ADDRESS OF EINTOA AND EINTOB ARE IN 
MEMORY LOCATION 03H AND 1003H, RESPECTIVELY. ASSUME THE IVTP 
HAS NOT BEEN CHANGED AFTER DEVICE RESET AND THE EXTERNAL 
UPT IIOFO IS ENABLED. WHEN THE FIRST IIOFO INTERRUPT 
L COMES IN, THE EINTOA ROUTINE WILL BE EXECUTED AND THI 

E NEXT IIOFO INTERRUPT SIGNAL OCCURS, THE EINTOB ROUTINE 
WILL BE EXECUTED, AND SO ON. THE EINTOA AND EINTOB ROUTINES 
WILL TAKE TURNS TO BE EXECUTED WHEN THE IIOFO INTERRUPT 
SIGNAL OCCURS. 


H 
= 
ia 
ia] 
Ee) 


ea 
Zz 


External IIOFO interrupt service routine A 


+ F + F F FF F F F F F HF 


-global EINTOA 
EINTOA: 


LDI 1000H, RO ;Change IVIP to point to 1000H 
DP RO, IVTP 


za 


RETI ;Return and enable interrupts 


External IIOFO interrupt service routine A 


-global EINTOB 
EINTOB: 


LDI 0,RO ;Change IVIP to point to 0 
LDPE RO, IVTP 


RETI ;Return and enable interrupts 


Interrupt Examples 


2.3.4 Nesting Interrupts 


In Example 2-5, the interrupt service routine for INT2 temporarily modifies the 
interrupt enable register (IIE) and interrupt flag register (IIF) to permit interrupt 
processing when an interrupt to INTO or NMI (but no other interrupt) occurs. 
When the routine finishes processing, the IIE register is restored to its original 
state. Notice that the RETIcond instruction not only pops the next program 
counter address from the stack, but also restores GIE and CF bits from the 
PGIE and PCF bits. This re-enables all interrupts that were enabled before the 
INT2 interrupt was serviced. 


Example 2-5. Interrupt Service Routine 


* TITLE INTERRUPT SERVICE ROUTINE 
-global ISR2 


ENABLE .set 2000h 
[ASK .set 9n 
* 
* INTERRUPT PROCESSING FOR EXTERNAL INTERRUPT INT2- 
* 
ISR2: 
PUSH ST ;Save status register 
PUSH DP ;Save data page pointer 
PUSH IIE ;Save interrupt enable register 
PUSH IIF 
PUSH RO ;Save lower 32 bits and 
PUSHF RO ;upper 32 bits of RO 
PUSH R1 ;Save lower 32 bits and 
PUSHF Rl ;upper 32 bits of RIL 
LDI 0, 11E ;Unmask all internal interrupts 
LDI [ASK, RO 
MHO RO, IIF ;Enable INT2 
OR ENABLE,ST ;Enable all interrupts 


* 
* MAIN PROCESSING SECTION FOR ISR2 


XOR ENABLE,ST ;Disable all interrupts 
POPF R1 ;Restore upper 32 bits and 
POP R1 ;lower 32 bits of R1 
POPF RO ;Restore upper 32 bits and 
POP RO j;lower 32 bits of RO 
POP TIF 
POP IIE ;Restore interrupt enable register 
POP DP ;Restore data page register 
POP ST ;Restore status register 

* 
RETI ;Return and enable interrupts 
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2.4 Context Switching in Interrupts and Subroutines 


Context switching is commonly required when a subroutine call or interrupt is 
processed. It can be extensive or simple, depending on system requirements. 
For the ’C4x, the program counter is automatically pushed onto the stack. Im- 
portant information in other ’C4x registers, such as the status, auxiliary, or ex- 
tended-precision registers, must be saved in the stack with PUSH/PUSHF and 
recovered later with POP/POPF instructions. 


You need to preserve only the registers that are modified inside of your subrou- 
tine or interrupt/trap service routine and that could potentially affect the pre- 
vious context environment. 


——$——$—_—<——<uo<o<“OwGeOQqoaoaaaaaa—a—oaoaana—0_ ——'V—<— iq 
Note: 


The status register should be saved first and restored last to preserve the 
processor status without any further change caused by other context-switch- 


ing instructions. 
| 


If the previous context environment was in C, then your program must perform 
one of two tasks: 


(1 Ifthe program is in a subroutine, it must preserve the dedicated C regis- 


ters: 
Save as integers Save as floating-point 
R4 RS R6 R7 
AR4 AR5 
AR6 AR7 
FP DP (small model only) 
SP R8 (‘C4x only) 


(Ifthe program is in an interrupt service routine, it must preserve all of the 
’C4x registers, as Example 2-6 shows. 


If the previous context environment was in assembly language, you need to 
determine which registers you must save based on the operations of your as- 
sembly-language code. 


Context Switching in Interrupts and Subroutines 


Example 2-6. Context Save and Context Restore 


% -global ISR1 

* 

im TOTAL CONTEXT SAVE ON INTERRUP 

* 

ISR1: PUSH ST ;Save status register 

* 

* SAVE THE EXTENDED PRECISION REGISTERS 

* 
PUSH RO ;Save the lower 32 bits of RO 
PUSHF RO ;and the upper 32 bits 
PUSH R1 ;Save the lower 32 bits of R1 
PUSHF R1 ;and the upper 32 bits 
PUSH R2 ;Save the lower 32 bits of R2 
PUSHF R2 ;and the upper 32 bits 
PUSH R3 ;Save the lower 32 bits of R3 
PUSHF R3 ;and the upper 32 bits 
PUSH R4 ;Save the lower 32 bits of R4 
PUSHF R4 ;and the upper 32 bits 
PUSH R5 ;Save the lower 32 bits of R5 
PUSHF R5 ;and the upper 32 bits 
PUSH R6 ;Save the lower 32 bits of R6 
PUSHF R6 ;and the upper 32 bits 
PUSH R7 ;Save the lower 32 bits of R7 
PUSHF R7 ;and the upper 32 bits 
PUSH R8 ;Save the lower 32 bits of R8 
PUSHF R8 ;and the upper 32 bits 
PUSH RY ;Save the lower 32 bits of R9 
PUSHF RQ ;and the upper 32 bits 
PUSH R10 ;Save the lower 32 bits of R10 
PUSHF R10 ;and the upper 32 bits 
PUSH R11 ;Save the lower 32 bits of R11 
PUSHF R11 ;and the upper 32 bits 

* 

* SAVE THE AUXILIARY REGISTERS 

* 
PUSH ARO ;Save ARO 
PUSH AR1 ;Save ARI 
PUSH AR2 ;Save AR2 
PUSH AR3 ;Save AR3 
PUSH AR4 ;Save AR4 
PUSH ARS 7;Save ARS 
PUSH AR6 ;Save AR6 
PUSH AR7 ;Save AR7 
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Example 2-6. Context Save and Context Restore (Continued) 


SAVE THE REST OF THE REGISTERS FROM THE REGISTER FILE 
PUSH DP ;Save data page pointer 
PUSH IRO ;Save index register IRO 
PUSH IR1 ;Save index register IR1 
PUSH BK ;Save block-size register 
PUSH IIE ;Save interrupt enable register 
PUSH IIF ;Save interrupt flag register 
PUSH DIE ;Save DMA interrupt enable register 
PUSH RS ;Save repeat start address 
PUSH RE ;Save repeat end address 
PUSH RC ;Save repeat counter 
* 
* SAVE IS COMPLETE 
* 
* 
* YOUR INTERRUPT SERVICE ROUTINE CODE GOES HERE* 
-global RESTR 
CONTEXT RESTORE AT TH END OF A SUBROUTINE CALL OR 
INTERRUPT. 
RESTR: 
* 
* RESTORE THE REST REGISTERS FROM THE REGISTER FILE 
* 
POP RC ;Restore repeat counter 
POP RE ;Restore repeat end address 
POP RS ;Restore repeat start address 
POP DIE ;Restore DMA interrupt enable register 
POP IIF ;Restore interrupt flag register 
POP IIE ;Restore interrupt enable register 
POP BK ;Restore block-size register 
POP IRL ;Restore index register IR1l 
POP TRO ;Restore index register IRO 
POP DP ;Restore data page pointer 
RESTORE THE AUXILIARY REGISTERS 
POP AR7 ;Restore AR7 
POP AR6 ;Restore AR6 
POP ARS ;Restore ARD5 
POP AR4 ;Restore AR4 
POP AR3 ;Restore AR3 
POP AR2 ;Restore AR2 
POP ARI ;Restore ARI 
POP ARO ;Restore ARO 
* 
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Context Switching in Interrupts and Subroutines 


Example 2-6. Context Save and Context Restore (Continued) 


* RESTORE THE EXTENDED PRECISION REGISTERS 

* 
POPF R11 ;Restore the upper 32 bits and 
POP R11 ;the lower 32 bits of R11 
POPF R10 ;Restore the upper 32 bits and 
POP R10 ;the lower 32 bits of R10 
POPF RO ;Restore the upper 32 bits and 
POP R9 ;the lower 32 bits of RY 
POPF R8 ;Restore the upper 32 bits and 
POP R8 ;the lower 32 bits of R8 
POPF R7 ;Restore the upper 32 bits and 
POP R7 ;the lower 32 bits of R7 
POPF R6 ;Restore the upper 32 bits and 
POP R6 ;the lower 32 bits of R6 
POPF R5 ;Restore the upper 32 bits and 
POP R5 ;the lower 32 bits of R5 
POPF R4 ;Restore the upper 32 bits and 
POP R4 ;the lower 32 bits of R4 
POPF R3 ;Restore the upper 32 bits and 
POP R3 ;the lower 32 bits of R3 
POPF R2 ;Restore the upper 32 bits and 
POP R2 ;the lower 32 bits of R2 
POPF R1 ;Restore the upper 32 bits and 
POP R1 ;the lower 32 bits of R1 
POPF RO ;Restore the upper 32 bits and 
POP RO ;the lower 32 bits of RO 
POP ST ;Restore status register 

* 

* RESTORE IS COMPLETE 

* 
RETI 
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2.5 Repeat Modes 


2.5.1 


Block Repeat 


The RPTB, RPTBD, and RPTS instructions support looping without overhead. 
Loop execution parameters are specified by three registers, as can be seen 
in the following examples: 


(1 RS (Repeat start address) 
(_j) RE (Repeat end address) 
1 RC (Repeat counter) 


In principle, it is possible to nest repeat blocks. However, there is only one set 
of control registers: RS, RE, and RC. It is, therefore, necessary to save these 
registers before entering an inside loop and to restore these registers after 
completing the inside loop. It takes four cycles of overhead to save and restore 
these registers. Hence, sometimes it may be more economical to implement 
a nested loop by the more traditional method of using a register as a counter 
and then using a delayed branch, rather than by using the nested repeat block 
approach. Often, implementing the outer loop as a counter and the inner loop 
as a RPTB/RPTBBD instruction produces the fastest execution. 


Example 2—7 shows the use of the block repeat to find the maximum or the 
minimum value of 147 numbers. The elements of the array are either all 
positive or all negative numbers. Because the loop cannot be predetermined, 
the RPTBD instruction is not suitable here. 


Example 2-7.Use of Block Repeat to Find a Maximum or a Minimum 


+ + tH 


TITLE USE OF BLOCK REPEAT TO FIND A MAXIMUM OR A MINIMUM 


THIS ROUTINE FINDS MAXIMUM OR MINIMUM OF N=147 NUMBERS 


LD 146,RC ;Initialize repeat counter to 147-1 
LD @ADDR, ARO ;ARO points to beginning of array 
LDF *ARO++(1),RO ;Initialize MAX or MIN to first value 
BLT LOOP2 ;If negative array, find minimum 


LOOP1 RPTB MAX 


CMPF *ARO,RO ;Compare number to the maximum 

MAX LDF LT *ARO,RO ;If greater, this is a new maximum 
B NEXT 

LOOP2 RPTB MIN 
CMPF *ARO++(1),RO ;Compare number to the minimum 

MIN LDF LT *-ARO (1) ,RO ;If smaller, this is new minimum 
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Repeat Modes 


Example 2-8 shows an application of the delayed block-repeat construct. In 
this example, an array of 64 elements is flipped over by exchanging the ele- 
ments that are equidistant from the end of the array. In other words, if the origi- 


nal array is: 


a(1), a(2),..., a(31), a(32),..., a(63), a(64); 


then the final array after the rearrangement is: 


a(64), a(63),..., a(32), a(31),..., a(2), a(1). 


Because the exchange operation is performed on two elements at the same 
time, it requires 32 operations. The repeat counter (RC) is initialized to 31. In 
general, if RC contains the number N, the loop is executed N + 1 times. In the 
example, the loop begins at the fourth instruction following the RPTBD instruc- 
tion (at the EXCH label). RC should not be initiated in the next three instruc- 


tions following the RPTBD. 


Example 2-8.Loop Using Delayed Block Repeat 


* ITLE LOOP USING DELAYED BLOCK REPEAT 
* 
zd THIS CODE SEGMENT EXCHANGES THE VALUES OF ARRAY 
* ELEMENTS THAT ARE SYMMETRIC AROUND THE MIDDLE OF THE 
* ARRAY. 
* 

LDI 31,RC ;Initialize repeat counter 
* 

RPTBD EXCH ;Repeat RC + 1 times between 

; START and EXCH 
LDI @ADDR, ARO ;ARO points to 
beginning of array 
LDI ARO, AR1 
ADDI 63,AR1 ;AR1 points to the end of the 
array 

* 
* The loop starts here 
START LDI *ARO, RO ;Load one memory element in RO, 
| | LDI *AR1,R1 ;and the other in R1 
EXCH STI R1, *ARO++(1) ; Then, exchange their locations 
| | STI RO, *AR1-— (1) 
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2.5.3 Single-Instruction Repeat 


Example 2-9 shows an application of the repeat-single construct. In this ex- 
ample, the sum of the products of two arrays is computed. The arrays are not 
necessarily different. If the arrays are a(i) and b(i), and if each is of length 

N = 512, register R2 contains the following quantity: 


a(1) b(1) + a(2) b(2) +...+ a(N) b(N). 


The value of the repeat counter (RC) is specified to be 511 in the instruction. 


Example 2-9.Loop Using Single Repeat 


TITLE LOOP USING SINGLE REPEAT 


LDI @ADDR1, ARO ;ARO points to array a(i) 


LDI @ADDR2, AR1 ;AR1 points to array b(i) 
* 
LDF 0.0,R2 jy Initialize RO 


MPYF3 *ARO++(1),*AR1++(1),R1 ;Compute first product 


RPTS Bi. ;Repeat 512 times 


MPYF3 *ARO++(1),*AR1++(1),R1 ;Compute next product and 
|| ADDF3 R1,R2,R2 ;accumulate the previous 


ADDF R1,R2 ;One final addition 
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2.6 Computed GOTOs to Select Subroutines at Runtime 


Occasionally, itis convenient to select during runtime, not during assembly, the 
subroutine to be executed. The ’C4x’s computed GOTO supports this selec- 
tion. You can implement the computed GOTO by using the CALLcond instruc- 
tion in the register addressing mode. This instruction uses the contents of the 
register as the address of the call. Example 2—10 shows the case of a task con- 
troller. 


Example 2-10. Computed GOTO 


* TITLE COMPUTED GOTO 
* 
* TASK CONTROLLER 
* 
‘ THIS MAIN ROUTINE CONTROLS THE ORDER OF TASK EXECUTION 
* (6 TASKS IN THE PRESENT EXAMPLE). TASKO THROUGH TASK5 ARE 
* THE NAMES OF SUBROUTINES TO BE CALLED. THEY ARE EXECUTED 
* IN ORDER, TASKO, TASK1, ... TASK5. WHEN AN INTERRUPT 
- OCCURS, THE INTERRUPT SERVICE ROUTINE IS EXECUTED, AND THE 
- PROCESSOR CONTINUES WITH THE INSTRUCTION FOLLOWING THE 
‘ IDLE INSTRUCTION. THIS ROUTINE SELECTS THE APPROPRIATE 
* TASK FOR THE CURRENT CYCLE, CALLS THE TASK AS A SUBROUTINE, 
* AND BRANCHES BACK TO THE IDLE INSTRUCTION TO WAIT FOR THE 
ta NEXT SAMPLE INTERRUPT WHEN THE SCHEDULED TASK HAS COMPLETED 
* EXECUTION. RO HOLDS THE OFFSET FROM THE BASE ADDRESS OF THE 
* TASK TO BE EXECUTED. BIT 15 (SET COND BIT) OF STATUS REGISTER 
id (ST) SHOULD BE SET TO 1. 
* 
LDI 5,IRO ;Initialize IRO 
LDI @ADDR, AR1L ;AR1 holds the base address 
7Of the table 
WAIT IDLE ;Wait for the next interrupt 
ADDI *+AR1 (IRO),R1 ;Add base address to the 
;table entry number 
SUBI 1,IRO ;Decrement IRO 
LDILT 5, IRO ;If IRO<O, reinitialize it to 5 
CALLU R1 ;Execute appropriate task 
BR WAIT 
* 
TSKSEQ -word TASKS 7Address of TASKS 
-word TASK4 ;Address of TASK4 
-word TASK3 ;Address of TASK3 
.-word TASK2 ;Address of TASK2 
-word TASK1 ;Address of TASK1 
-word TASKO ;Address of TASKO 
ADDR -word TSKSEQ 
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Chapter 3 


Logical and Arithmetic Operations 


The 'C4x instruction set supports both integer and floating-point arithmetic and 
logical operations. The basic functions of such instructions can be combined 


to form more complex operations. This chapter contains the following 
tions examples: 


.j Bit manipulation 

Li Block moves 

[1 Byte and half-word manipulation 

(j Bit-reversed addressing 

.) Integer and floating-point division 

1) Square root 

[4 Extended-precision arithmetic 

[J Floating-point format conversion between IEEE and ’C4x formats 


opera- 
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3.3. Byte and Half-Word Manipulation ............. 00 cece cence eee 
3.4 Bit-Reversed Addressing .........0.ceeceeee eee e eee e eens 
3.5 Integer and Floating-Point Division .............00 cee eee eee 
3.6 Calculating a Square Root .............:0e eee e eee eee eee 
3.7 Extended-Precision Arithmetic ...........0 0.0 cece eee eee eee 


3.8 Floating-Point Format Conversion: IEEE to/From ’C4x .......... 
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3.1. Bit Manipulation 


Instructions for logical operations, such as AND, OR, NOT, ANDN, and XOR, 
can be used together with shift instructions for bit manipulation. A special 
instruction, TSTB, tests bits. TSTB does the same operation as AND, but the 
result of the TSTB is used only to set the condition flags and is not written any- 
where. Example 3-1 and Example 3—2 demonstrate the use of several in- 
structions for bit manipulation and testing 


Example 3—1.Use of TSTB for Software-Controlled Interrupt 


* TITLE USE OF TSTB FOR SOFTWARE-CONTROLLED INTERRUPT 

* 

* IN THIS EXAMPLE, ALL INTERRUPTS HAVE BEEN DISABLED BY 

* RESETTING HE GIE BIT OF HE STATUS REGISTER. WHEN AN 

* INTERRUPT ARRIVES, I IS STORED I THE IF REGISTER. THE 

* PRESEN XAMPLE ACTIVATES HE INTERRUPT SERVICE ROUTINE INTR 

* WHEN I D ECTS THAT INT2-— HAS OCCURRED. 
TSTB 4,11F ; Check if bit 2 of IF is set, 
CALLNZ INTR ; and, if so, call subroutine INTR 


Example 3-2. Copy a Bit from One Location to Another 


TITLE COPY 


A BIT FROM ONE LOCATION TO ANOTHER 


* 

* 

* BIT I OF RL 
* HOLDING I, 
* 


LDI 
LSH 
TSTB 
BZD 
LDI 
LSH 
ANDN 
OR 
CONT 


NEEDS TO BE COPIED TO BIT J OF R2. ARO POINTS TO A LOCATION 
AND IT IS ASSUMED THAT THE NEX EMORY LOCATION HOLDS THE VALUE J. 


1,R0 

*ARO, RO ;Shift 1 to align it with bit I 
R1,RO ;Test the I-th bit of R1 

CONT ;If bit = 0, branch delayed 

1,R0 

*+ARO(1),RO ;Align 1 with J-th location 

RO,R2 ;If bit = 0, reset J-th bit of R2 
RO, R2 ;If bit = 1, set J-th bit of R2 
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Block Moves 


Because the ’C4x directly addresses a large amount of memory, blocks of data 
or program code can be stored off-chip in slow memories and then loaded 
on-chip for faster execution. Data can also be moved from on-chip memory to 
off-chip memory for storage or for multiprocessor data transfers. 


The DMA can transfer data efficiently in parallel with CPU operations. Alterna- 
tively, you can use the load and store instructions in a repeat mode to perform 
data transfers under program control. Example 3-3 shows how to transfer a 
block of 512 floating-point numbers from external memory to block 1 of on-chip 


RAM. 


Example 3-3.Block Move Under Program Control 


- TITLI 


E BLOCK MOVE UNDER PROGRAM CONTROL 


* 


extern 
blockl 


.word 
.word 


LDI 
LDI 
LDF 
RPTS 
LDF 
ST 
ST 


FE 
F 


01000H 
O2FFCOOH 


@extern, ARO 
@block1,AR1 
*ARO++, RO 
510 
*ARO++,RO 
RO, *AR1++ 
RO, *AR1 


, 


Nee Ne 


ts 
’ 


’ 


Source address 

Destination address 

Load the first number 

Repeat following instruction 511 times 


;Load the next number, and... 


store the previous one 
Store the last number 
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3.3. Byte and Half-Word Manipulation 


Example 3-4. Use of Packing Data From Half-Word FIFO to 32-Bit Data Memory 


A set of instructions for byte and half-word accessibility, such as LB(3,2,1,0), 
LBU(8,2,1,0), LH(1,0), LHU(1,0), LWL(0,1,2,3), LWR(0,1,2,3), MB(3,2,1,0), 
and MH(1,0), is available on the ’C4x. In an application such as image process- 
ing, it is often important to be able to manipulate packed data. For example, 
the pixels in color images are often represented by four 8-bit unsigned quanti- 
ties — red, green, blue and aloha — which are packed into a single 32-bit 
word. The byte and half-word instruction makes it very easy to manipulate this 


packed data. 


Example 3-4 shows the packing of data from a half-word FIFO to 32-bit data 
memory, and Example 3-5 shows the unpacking of a 32-bit data array into a 
4-byte-wide data array (assuming the 32-bit data array contains four 8-bit un- 


signed numbers). 


* ITLE USE OF PACKING DATA FROM HALF-WORD FIFO 
* TO 32-BIT DATA MEMORY 
* 
* IN THIS EXAMPLE, EVERY TWO INPUT 16 BITS DATA HAS BEEN 
* PACKED INTO ONE 32-BIT DATA MEMORY. THE LOOP SIZE 
* USED HERE IS ARRAY SIZE, NOT THE INPUT DATA LENGTH. 
LDI size-1,RC ;Load array size 
RP TBD PACK 
LDI @fifo_adr,AR1 ;Load fifo address 
LD @array,AR2 ;Load data array address 
NOP 
* > >>>>>>>>>>>>>>> ;Loop starts here 
WLO *AR1,R9 ;Pack 16 LSBs 
WLI *AR1,R9 ;Pack 16 MSBs 
PACK STi R9, *AR2++ (1) ;Store the data 
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Example 3-5. Use of Unpacking 32-Bit Data Into Four-Byte-Wide Data Array 


ITLE USE OF UNPACKING 32-BIT DATA INTO FOUR BYTE-WIDE 
DATA ARRA 


K 


+ + F FH 


HIS EXAMPLE ASSUMED THAT THE 32-BIT DATA CONTAINS FOUR 8-BIT 
UNSIGNED DATA. 


: LDI size-1,RC ;Load array size 

LDI @input_adr, ARO ;Load RPTBD UNPACK input address 
LDI @arrayl,AR1 ;Load output data array 1 address 
RPTBD UNPACK 

LDI @array2,AR2 ;Load output data array 2 address 
LDI @array3,AR3 ;Load output data array 3 address 
LDI @array4, AR4 ;Load output data array 4 address 


* >> >> >>> >>>>>>>>> Loop starts here 


~ 


LBUO *ARO,R8 ;Unpack first byte 

STI R8, *AR1++ (1) 

LBU1 *ARO,R8 ;Unpack second byte 

STI R8, *AR2++ (1) 

LBU2 *ARO,R8 ;Unpack third byte 

STI R8, *AR3++ (1) 

LBU3 *ARO++(1),R8 ;Unpack fourth byte 
UNPACK STI R8, *AR4++ (1) 
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3.4 Bit-Reversed Addressing 


The ’C4x can implement fast Fourier transforms (FFT) with bit-reversed ad- 
dressing. If the data to be transformed is in the correct order, the final result 
of the FFT is scrambled in bit-reversed order. To recover the frequency-do- 
main data in the correct order, certain memory locations must be swapped. 
The bit-reversed addressing mode makes swapping unnecessary. The next 
time data is accessed, the access is bit-reversed rather than sequential. In 
’C4x, this bit-reversed addressing can be implemented through both the CPU 
and DMA. 


For correct CPU or DMA bit-reversed operation, the base address of bit-re- 
versed addressing must be located on a boundary of the size of the table. To 
clarify this point, assume an FFT of size N = 2. When real and imaginary data 
are stored in separate arrays, the n LSBs of the base address must be zero, 
(0) and IRO must be initialized to 2"—1 (half of the FFT size). When real and 
imaginary data are stored in consecutive memory locations (Re—/m—Re-—Im) 
the n+7 LSBs of the base address must be zero (0), and IRO must be equal 
to IRO = 2" =N (FFT size). 


3.4.1. CPU Bit-Reversed Addressing 
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One auxiliary register (ARO, in this case) points to the physical location of a 
data value. When you add IRO to the auxiliary register by using bit-reversed 
addressing, addresses are generated in a bit-reversed fashion (reverse carry 
propagation). The largest index (IRO, in this case) for bit reversing is OOFF 
FFFFh. 


Example 3-6 illustrates how to move a 512-point complex FFT from the place 
of computation (pointed at by ARO) to a location pointed at by AR1. Reads are 
executed in a bit-reversed fashion and writes in a linear fashion. In this exam- 
ple, real and imaginary parts XR(i) and XI(i) of the data are not stored in sepa- 
rate arrays, but they are interleaved with XR(0), XI(0), XR(1), XI(1), ..., XR(N1), 
XI(N1). Because of this arrangement, the length of the array is 2N instead of 
N, and IRO is set to 512 instead of 256. 


Bit-Reversed Addressing 


Example 3—6.CPU Bit-Reversed Addressing 


* 
* TITLE BIT-REVERSED ADDRESSING 
* 
* THIS EXAMPLE MOVES THE RESULT OF THE 512-POINT FFT COMPUTATION, POINTED AT BY 
* ARO, TO A LOCATION POINTED AT BY AR1. REAL AND IMAGINARY POINTS ARE ALTERNATING. 
* 
LDI old7RE ;Repeat 511+1 times 
RPTBD LOOP 
LDI 512; TRO ;Load FFT size 
LDI 2 ERD. 
LDF *+AR0O(1),R1 ;Load first imaginary point 
* 
LDF *ARO++ (IRO)B, RO ;Load real value (and point to next 
\ | STF R1, *+AR1 (1) ;location) and store the imaginary value 
LOOP LDF *+ARO0(1),R1 ;Load next imaginary point and store 
| | STF RO, *AR1++(IR1) ;previous real value 


3.4.2 DMA Bit-Reversed Addressing 


In DMA bit-reversed addressing, two bits in the DMA control register enable 
bit-reversed addressing on DMA reads (READ BIT REV) and DMA writes 
(WRITE BIT REV). The source address index register and destination address 
index register define the size of the bit-reversed addressing. Their function is 
similar to the CPU index register IRO described in the previous subsection. 
Two DMA block transfers are required when the DMA is used for bit-reversed 
transfer of complex numbers: one to transfer the real ports and one to transfer 
the imaginary ports. 


Figure 3-1 illustrates the DMA settings required for a DMA operation equiva- 
lent to Example 3-6. Unified-autoinitialization mode and bit-reversed read are 
used. For more detailed information about DMA operation, refer to The DMA 
Coprocessor in the TMS320C4x User’s Guide. 
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Figure 3—1. DMA Bit-Reversed Addressing 


Control Register 00C0 1009h 


src Address ARO 


src Index IRO 


Counter 512 


dst Address 


dst Index 


Link Pointer 


3-8 


> label 


00C0 1005h 


ARO+1 


IRO 
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3.5 Integer and Floating-Point Division 


You can use the single-cycle instruction, RCPF, to generate an estimate of the 
reciprocal of a floating-point number. This estimate has the correct exponent, 
and the mantissa is accurate to the eighth binary place (the error of the mantis- 
sais < 2-8). Often, this is a satisfactory estimate of the reciprocal of a floating- 
point number. In other cases, this estimate can be used as a seed for an algo- 
rithm that computes the reciprocal to even greater accuracy. The Newton- 
Raphson algorithm described later is one such case. 


Although it provides no special instruction for integer division, the instruction 
set can perform an efficient division routine. Additionally, the FLOAT, RCPF, 
and FIX instructions can produce a rough estimate. 


3.5.1 Integer Division 


You can implement division on the ’C4x by repeating SUBC, a special condi- 
tional subtract instruction. Consider the case of a 32-bit positive dividend with 
i significant bits (and 32-/ sign bits), and a 32-bit positive divisor with / signifi- 
cant bits (and 32-/sign bits). The repetition of the SUBC command /-j + 7 times 
produces a 32-bit result in which the lower /-/ + 7 bits are the quotient, and the 
upper 31-/ + / bits are the remainder of the division. 


SUBC implements binary division in the same manner as long division. The 
divisor (assumed to be smaller than the dividend) is shifted left /~/times to align 
with the dividend. Then, using SUBC, the shifted divisor is subtracted from the 
dividend. For each subtract that does not produce a negative answer, the divi- 
dend is replaced by the difference. It is then shifted to the left, and the LSB is 
set to 1. If the difference is negative, the dividend is simply shifted left by one. 
This operation is repeated ij + 7 times. 


As an example, consider the division of 33 by 5 using both long division and 
the SUBC method. In this case, i= 6, j = 3, and the SUBC operation is repeated 
6-3 + 1 =4 times. 


LONG DIVISION: 


Quotient 
00000000000000000000000000000110 


00000000000000000000000000000101 00000000000000000000000000100001 
=L01 


1101 
-101 


Remainder 11 
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SUBC METHOD: 


00000000000000000000000000100001 
00000000000000000000000000101000 


Negative difference 


J 


00000000000000000000000000100001 
00000000000000000000000000101000 


00000000000000000000000000011010 


00000000000000000000000000100001 


00000000000000000000000000101000 


00000000000000000000000000011010 


00000000000000000000000000011011 


00000000000000000000000000101000 


Negative difference 


J 


00000000000000000000000000110110 


uy y 


Remainder Quot 


Dividend 

Divisor (aligned) 
(1st SUBC com- 
mand) 


New Dividend + Quotient 
Divisor 

Difference (>0) (2nd SUBC 
commana) 


New Dividend + Quotient 
Divisor 

Difference (>0) (8rd SUBC 
commana) 


New Dividend + Quotient 
Divisor 
(4th SUBC commana) 


Final Result 


When the SUBC command is used, both the dividend and the divisor must be 
positive. Example 3-7 shows a realization of the integer division in which the 
sign of the quotient is properly handled. The last instruction before returning 
modifies the condition flag, in case subsequent operations depend on the sign 


of the result. 


Integer and Floating-Point Division 


Example 3-7. Integer Division 


* 

* TITLE INTEGER DIVISION 

* 

* SUBROUTINE DIVI 

* 

* 

* INPUTS: SIGNED INTEGER DIVIDEND IN RO, 
* SIGNED INTEGER DIVISOR IN Rl. 
* 

* OUTPUT: RO/R1 into RO. 

* 

* REGISTERS USED: RO-R3, IRO, IR1 

* 

* OPERATION: 1. NORMALIZE DIVISOR WITH DIVIDEND 
* 2. REPEAT SUBC 

7 3. QUOTIENT IS IN LSBs OF RESULT 
* 

as CYCLES: 31-62 (DEPENDS ON AMOUNT OF NORMALIZATION) 
- -globl DIVI 

SIGN .set R2 

EMPF .set R3 

EMP ~set IRO 

COUNT .set IR1 

bad DIVI - SIGNED DIVISION 

DIVE 3 


* 


* DETERMINE SIGN OF RESULT. GET ABSOLUTE VALUE OF OPERANDS. 
* 


XOR RO,R1, SIGN ;Get the sign 
ABSI RO 
ABSI R1 
CMP I RO,R1 ;Divisor > dividend ? 
BGTD ZERO ;If so, return 0 
* 
bad NORMALIZE OPERANDS. USE DIFFERENCE IN EXPONENTS AS SHIFT COUNT 
* FOR DIVISOR, AND AS REPEAT COUNT FOR ’SUBC’. 
* 
FLOAT RO, TEMPF ;Normalize dividend 
PUSHF TEMPF ;PUSH as float 
POP COUNT ;POP as int 
LSH -—24, COUNT ;Get dividend exponent 
FLOAT R1, TEMPE ;Normalize divisor 
PUSHF EMPE ;PUSH as float 
POP EMP ;POP as int 
LSH —24, TEMP ;Get divisor exponent 
SUBI EMP , COUNT ;Get difference in exponents 
LSH COUNT, R1 ;Align divisor with dividend 
* 
* DO COUNT+1 SUBTRACT & SHIFTS. 
* 
RPTS COUNT 
SUBC R1,RO 
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Example 3-7. Integer Division (Continued) 


* 


* MASK OFF THE LOWER COUNT+1 BITS OF RO 


SUBRI 31, COUNT PSHift count. is (32 = (COUNT+1)) 
LSH COUNT, RO ;Shift left 

NEGLI COUNT 

LSH COUNT, RO ;Shift right to get result 


- CHECK SIGN AND NEGATE RESULT IF NECESSARY. 


NEGI RO,R1 ;Negate result 
ASH -—31,SIGN ;Check sign 
LDINZ R1,RO ;If set, use negative result 
CMPI 0,RO ;Set status from result RETS 
* 
* RETURN ZERO 
* 
ZERO 
LDI 0,RO 
RETS 
end 


3.5.2 Computation 


If the dividend is less than the divisor and you want fractional division, you can 
perform a division after you determine the desired accuracy of the quotient in 
bits. If the desired accuracy is k bits, start by shifting the dividend left by k posi- 
tions. Then apply the algorithm described above, and replace with / + k. It is 
assumed that / + kis less than 32. 


of Floating-Point Inverse and Division 


When you use the RCPF (reciprocal of a floating-point number) instruction to 
generate an estimate of the reciprocal of a floating-point number, you can also 
use Newton-Raphson algorithm to extend the precision of the mantissa of the 
reciprocal of a floating-point number that the instruction generates. The floa- 
ting-point division can be obtained by multiplying the dividend and the recipro- 
cal of the divisor. 


The input to RCPF is assumed to be v = v(man) x 2V(€XP). The output is x = 
x(man) x 2 X(€xP). The value v(man) (or x(man)) is composed of three fields: 
the sign bit v(sign), an implied nonsign bit, and the fraction field v(frac). 


Four rules apply to generating the reciprocal of a floating-point number: 


1) Ifv>0, then x(exp) =—v(exp) — 1, and x(man) = 2/v(man). 
For the special case in which the 10 MSBs of v(man) = 01.00000000b, 
then x(man) = 2—2 -8 = 01.11111111b. In both cases, the 23 LSBs of 
x(frac) = 0. 


2) Ifv <0, then x(exp) =—v(exp) — 1, and x(man) = 2/v(man). 
For the special case in which the 10 MSBs of v(man) = 10.00000000b, 


Integer and Floating-Point Division 


then x(man) = —1 — 2-8 = 10.11111111b. In both cases, the 23 LSBs of 
x(frac) = 0. 


3) Ifv=0(v(exp) =—128 ), then x(exp) = 127, and 
x(man) = 01.1111111999111111111911111111111b. 
In other words, if v = 0, then x becomes the largest positive number repre- 
sentable in the extended-precision floating-point format. The overflow flag 
(V) is set to 1. 


4) If v(exp) = 127, then x(exp) = —128, and x(man) = 0. 
The zero flag (Z) is set to 1. 


The Newton-Raphson algorithm is: 
x[n+1] = x[n](2.0 — vx[n]) 


In this algorithm, vis the number for which the reciprocal is desired. x[0] is the 
seed for the algorithm and is given by RCPF. At every iteration of the algorithm, 
the number of bits of accuracy in the mantissa doubles. Using RCPF, accuracy 
starts at eight bits. With one iteration, accuracy increases to16 bits in the man- 
tissa, and with the second iteration, accuracy increases to 32 bits in the mantis- 
sa. Example 3-8 shows the program for implementing this algorithm on the 
C4x. 
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Example 3-8. Inverse of a Floating-Point Number With 32-Bit Mantissa Accuracy 


TITLE INVERS 


MANTISSA ACC 


SUBROUTINI 


ira 
H 


URACY 


VE 


HE FLOATING-POINT NUMBE 
COMPUTATION IS COMPLETED, 1/v IS STORED IN R1. 


TYPICAL CALLING SEQUENCE: 


Rv IS STORED IN RO. AFTER THE 


E OF A FLOATING-POINT NUMBER WITH 32-BIT 


3-14 


+ + + + + + + + FF F F FF F F F FF FF F F F FF F F FF KF FH 


-global 


INVF: RCPF 


INVE 


RO,R1 ; 
, 
R1,RO,R2 


2.0,R2 
R2,R1 ; 


~ 


R11 


~ 


R1,RO,R2 
2.0,R2 
R2,R1 2 


, 


LAJU INVF 

LDF v, RO 

NOP < can be other non-pipeline-break 

NOP <---- instructions 

ARGUMENT ASSIGNMENTS: 

ARGUMENT | FUNCTION 

ae ee ee SD eek ee eee 
RO | v = NUMBER TO FIND THE RECIPROCAL OF 

| (UPON THE CALL) 

R1 | 1/v (UPON THE RETURN) 

REGISTER USED AS INPUT: RO 

REGISTERS MODIFIED: Rl, R2 

REGISTER CONTAINING RESULT: R1 

REGISTER FOR SUBROUTINE CALL: Rll 


CYCLES: 7 (not including subroutine overhead) 
WORDS: 8 (not including subroutine overhead) 


Get x[0] = the 
estimate of 1/v, RO =v 


End of first iteration 


(16 bits accuracy) 


Delayed return to caller 


End of second iteration 
; (32 bits accuracy) 


Rl = 1/v, Return to caller 


Calculating a Square Root 


3.6 Calculating a Square Root 


In many applications, normalization of data values is necessary. Often, the 
normalizing factor is the square root of another quantity. For example, given 
a vector, the unit vector in the same direction as the original vector can be 
found by normalizing the original vector by its length. This involves a division 
by a square root. The ’C4x single-cycle instruction RSQRF generates an 
estimate of the reciprocal of the square root of a positive floating-point number. 
This estimate has the correct exponent, and the mantissa is accurate to the 
eighth binary place (the error of the mantissa is < 2-8). Three rules apply to this 
algorithm: 


1) If v(exp) is even, then x(exp) = —(v(exp)/2) — 1, and 
x(man) = 2/sqrt(v(man)). 


For the special case where the 10 MSBs of y(man) = 01.00000000b, then 
x(man) = 2—2-8 = 01.11111111b. In both cases, the 23 LSBs of x(frac) = 0. 


2) If v(exp) is odd, then x(exp) = —((v(exp) — 1)/2) — 1 and 
x(man) = sqrt(2/v(man)). The 23 LSBs of x(frac) = 0. 


3) Ifv=0(v(exp) =—-128 ), then x(exp) = 127, and 
x(man) = 01.1141411119141111141111111111111111b. 
In other words, if v = 0, then x becomes the largest positive number repre- 
sentable in the extended-precision floating-point format. The overflow flag 
(V) is set to 1. 


If you need larger precision than the RSQRF instruction gives for the estimate 
of the reciprocal of the square root, you can use the Newton-Raphson algo- 
rithm to further extend the precision of the mantissa. The algorithm is: 


x[n+1] = x[n](1.5 — (v/2) x [n] x [n]) 


In this equation, vis the number for which the reciprocal is desired. x[0] is the 
seed for the algorithm and is given by RSQRF. At every iteration of the algo- 
rithm, the number of bits of accuracy in the mantissa doubles. Using RSQRF, 
accuracy starts at eight bits. With one iteration, accuracy increases to16 bits, 
and with the second iteration, accuracy increases to 32 bits in the mantissa. 
Example 3-9 shows the program for implementing this algorithm on the 'C4x. 
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Example 3-9. Reciprocal of the Square Root of a Positive Floating Point 


TITLE RECIPROCAL OF THE SQUARE ROOT OF A POSITIV 
FLOATING-POINT 


Fl 


SUBROUTINE RCPSQRF 


HE FLOATING-POINT NUMBER v IS STORED IN RO. AFTER THE 
COMPUTATION IS COMPLETED, 1/SQRT(v) IS STORED IN R1. 


Tr 


TYPICAL CALLING SEQUENCE: 
LDF v, RO 
LAJU RCPSORF 


ARGUMENT ASSIGNMENTS: 


ARGUMENT | FUNCTION 
+ 

RO | v = NUMBER TO FIND THE RECIPROCAL OF 
| (UPON THE CALL) 

R1 | 1/sqrt(v) (UPON THE RETURN) 

REGISTER USED AS INPUT: RO 

REGISTERS MODIFIED: Rl, R2 

REGISTER CONTAINING RESULT:  R1 

REGISTER FOR SUBROUTINE CALL: R11 


CYCLES: 10 (not including subroutine overhead) 
WORDS: 10 (not including subroutine overhead) 


+ + + FF F FF F FF F F FF FF FF F F FF F F F KF F 


-global RCPSQRF 
* 
RCPSORF: RSQORF RO,R1 ;Get x[0] = th stimate of 1/sqrt(v), RO =v 
PYF 0.25:;R0 ;RO = v/2 
* 
PYF3 R1,R1,R2 ;First iteration 
PYF RO,R2 
SUBRF Ls5),R2 
PYF R2,R1 ;End of first iteration (16 bits accuracy) 
* 
PYF3 R1,R1,R2 ;Second iteration 
* 
BRD R11 ;Delayed return to caller 
* 
MPYF RO,R2 
SUBRF 1 5)R2 
MPYF R2,R1 7End of second iteration (32 bits accuracy) 
* 
* Rl = 1/SORT(v), Return to caller 
* 


-end 


You can find the square root by a simple multiplication: sqrt(v) = vx[n] in which 
x[n] is the estimate of 1/sqrt(v) as determined by the Newton-Raphson algo- 
rithm or another algorithm. 
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3.7 Extended-Precision Arithmetic 


The ’C4x offers 32 bits of precision in the mantissa for integer arithmetic, and 
24 bits of precision in the mantissa for floating-point arithmetic. For higher pre- 
cision in floating-point operations, the twelve extended-precision registers, RO 
to R11, contain eight more bits of accuracy. Because no comparable extension 
is available for fixed-point arithmetic, this section discusses how to achieve 
fixed-point double precision. The technique consists of performing the arith- 
metic by parts and is similar to the way in which longhand arithmetic is done. 


The instructions, ADDC (add with carry) and SUBB (subtract with borrow) use 
the status carry bit for extended-precision arithmetic. The carry bit is affected 
by the arithmetic operations of the ALU and by the rotate and shift instructions. 
You can also manipulate it directly by setting the status register to certain val- 
ues. For proper operation, the overflow mode bit should be reset (OVM = 0) 
so that the accumulator results are not loaded with the saturation values. 
Example 3-10 and Example 3-11 show 64-bit addition and 64-bit subtraction, 
respectively. The first operand is stored in the registers RO (low word) and R1 
(high word). The second operand is stored in registers R2 and R3, respective- 
ly. The result is stored in RO and R1. 


Example 3-10. 64-Bit Addition 


* 
* ITLE 64-BIT ADDITION 

* 

* TWO 64-BIT NUMBERS ARE ADDED TO EACH OTHER PRODUCING 
* 

* A 64-BIT RESULT. THE NUMBERS X (R1,RO) AND Y (R3,R2) 
* 

* ARE ADDED, RESULTING IN W (R1,R0). 

* 

a Rl RO 

4 + R3 R2 

de ete 

bg Rl RO 

* 


ADDI R2,R0 
ADDC R3,R1 
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Example 3-11. 64-Bit Subtraction 


+ + + F F F F F F F HF OF 


T 


T 
P 
¥ 


ITLE 


WO 6 


64-BIT SU 


BTRACTION 


4-BIT NUMB 


RODUCING A 64- 


(R3 


BIT RESULT. THE 


ERS ARE 


SUBTRACTED FROM 
NUMBERS 


,R2) ARE § 


UBTRACTED, RESULTING IN 


EACH OTHER 
X (R1,RO) 
W (R1,RO). 


When two 32-bit numbers are multiplied, a 64-bit product results. To do this, 
’C4x provides a 32 bit x 32-bit multiplier and two special instructions, MPYSHI 
(multiply signed integer and produce 32 MSBs) and MPYUHI (multiply un- 
signed integer and produce 32 MSBs). Example 3-12 shows the implementa- 


tion of a 32-bit x 32-bit multiplication. 


Example 3-12. 32-Bit by 32-Bit Multiplication 
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+ + F FF F F F FF F OF 


THE TWO NUMB 


MULTIPLIES 2 32-BIT NUMBERS, 
ERS RO AND R1 ARE MULTIPLIED, 


IN W (R3,R2). 
RO 
x R1 
R3 R2 


TITLE 32 BIT X 32-BIT MULTIPLICATION 


PRODUCING A 64-BIT RESULT. 


RESULTING 


MPYI3 RO,R1,R2 


MP 


YSHI3 RO, 


R1,R3 
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3.8 Floating-Point Format Conversion: IEEE to/From ’C4x 


In fixed-point arithmetic, the binary point that separates the integer from the 
fractional part of the number is fixed at a certain location. Therefore, if the 
binary point of a 32-bit number is fixed after the most significant bit (which is 
also the sign bit), only a fractional number (a number with an absolute value 
less than 1), canbe represented. In other words, there is anumber with 31 frac- 
tional bits. All operations assume that the binary point is fixed at this location. 
The fixed-point system, although simple to implement in hardware, imposes 
limitations in the dynamic range of the represented number. This causes scal- 
ing problems in many applications. You can avoid this difficulty by using floa- 
ting-point numbers. 


A floating-point number consists of a mantissa m multiplied by base b raised 
to an exponent e: 


mx b& 


In current hardware implementations, the mantissa is typically a normalized 
number with an absolute value between 1 and 2, and the base is b = 2. Al- 
though the mantissa is represented as a fixed-point number, the actual value 
of the overall number floats the binary point because of the multiplication by 
b®&. The exponent e is an integer whose value determines the position of the 
binary point in the number. IEEE has established a standard format for the re- 
presentation of floating-point numbers. 


To achieve higher efficiency in the hardware implementation, the ’C4x uses a 
floating-point format that differs from the IEEE standard. However, 'C4x has 
two single-cycle instructions, TOIEEE and FRIEEE, for the format conversion. 
These two instructions can also be used with the STF instruction, which allows 
the data format to be converted within memory-to-memory transfer. Here are 
descriptions of both formats and an example program to convert between 
them. 


’C4x floating-point format: 
8 bits 1 23 bits 


poe ff 


In a 32-bit word representing a floating-point number, the first 8 bits corre- 
spond to the exponent expressed in twos-complement format. One bit is for 
sign, and 23 bits are for the mantissa. The mantissa is expressed in twos-com- 
plement form with the binary point after the most significant nonsign bit. Be- 
cause this bit is the complement of the sign bit s, it is suppressed. In other 
words, the mantissa actually has 24 bits. One special case occurs when 
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e =—128. In this case, the number is interpreted as zero, independently of the 
values of s and f (which are, by default, set to zero). To summarize, the values 
of the represented numbers in the ’C4x floating-point format are as follows: 


2e* (01.f) if s= 0 
2@* (10.f) ifs=1 
0 if e = 128 


IEEE floating-point format: 
1 8 bits 23 bits 


Ee ee ee 


The IEEE floating-point format uses sign-magnitude notation for the mantissa. 
In a 32-bit word representing a floating-point number, the first bit is the sign bit. 
The next 8 bits correspond to the exponent, expressed in an offset-by-127 for- 
mat (the actual exponent is e-127). The following 23 bits represent the abso- 
lute value of the mantissa with the most significant 1 implied. The binary point 
is fixed after this most significant 1. In other words, the mantissa actually has 
24 bits. Several special cases are summarized below. 


These are values of the represented numbers in the IEEE floating-point for- 
mat: 


(-1)$* 26-127 * (01.f) if0<e< 255 

Special cases: 

(-1)§* if e=0 and f =0 (zero) 

(— 12 a 26 * (0.f) if e= 0 and f <> 0 (denormalized) 
(—1)§ * infinity if e = 255 and f = 0 (infinity) 

NaN (not a number) if e= 255 andf <> 0 


The ’C4x performs the conversion according to these definitions of the for- 
mats. It assumes that the source data for the IEEE format is in memory only 
and that the source data for the ’C4x floating-point format is in either memory 
or an extended-precision register. The destination for both conversions must 
be in an extended-precision register. In the case of block memory transfer, the 
no-penalty data-format conversion can be executed by parallel instruction with 
STF. Example 3-13 and Example 3-14 show the data-format conversion 
within the data transformation between communication port and internal RAM. 
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Example 3-13. IEEE to 'C4x Conversion Within Block Memory Transfer 


ITLE IE 
RANSFER 


EE TO ’C4x CONVERSION WITHIN BLOCK MEMORY 


PROGRAM ASSUMES THAT INPUT FIFO OF COMMUNICATION PORT 0 
IS FULL OF IEEE FORMAT DATA. EIGHT DATA WORDS ARE 
RANSFERRED FROM COMMUNICATION PORT 0 TO INTERNAL RAM 
BLOCK Q AND THE DATA FORMAT IS CONVERTED FROM IEEE FORMAT 
TO ’C4x FLOATING-POINT FORMAT. 


+ + FF FF F HF FH 


LDI @CPO_IN, ARO ;Load comm portO input FIFO address 


LDI @RAMO, AR1 ;Load internal RAM block 0 address 
FRIEEE *ARO, RO ;Convert first data 
RPTS 6 
FRIEEE *ARO, RO ;Convert next data 
{| STF RO, *AR1++(1) ;Store previous data 


STF RO, *AR1++(1) ;Store last data 


Example 3-14. ’'C4x to IEEE Conversion Within Block Memory Transfer 


% ITLE ’C4x TO IEEE CONVERSION WITHIN BLOCK MEMORY 
* RANSFER 
* 
* PROGRAM ASSUMES THAT OUTPUT FIFO OF COMMUNICATION PORT 0 
w IS EMPTY. EIGHT DATA WORDS ARE TRANSFERRED FROM INTERNAL 
* RAM BLOCK 0 TO COMMUNICATION PORT 0 AND THE DATA FORMAT 
* IS CONVERTED FROM ’C4x FLOATING-POINT FORMAT TO 
* IEEE FORMAT. 
* 
LDI @CPO_OUT,ARO ;Load comm portO output FIFO address 
LDI @RAMO, AR1 ;Load internal RAM block 0 address 
OIEEE *AR1++(1),RO ;Convert first data 
RPTS 6 
OIEEE *AR1++(1),RO ;Convert next data 
1 | STF RO, *ARO ; Store previous data 
STF RO, *ARO ;Store last data 
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Chapter 4 


Memory Interfacing 


The ’'C4x’s advanced interface design can be used to implement a wide variety 
of system configurations. Its two external buses and DMA capability provide 
a flexible parallel 32-bit interface to byte- or word-wide devices. 


This chapter describes how to use the ’C4x’s memory interfaces to connect to 
various external devices. Specific discussions include implementation of a 
parallel interface to devices with and without wait states and implementing 
system control functions. 


4.1 SVE Cue ite eli seasenosnsocossos seacsecononsenesseaenaee 4-2 
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4.4 Zero Wait-State Interfacing to RAMS ...........-..0002eeeee eens 4-5 
4.5 Wait States and Ready Generation .............2002seeeeeeeee 4-11 
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4.1 System Configuration 


Figure 4—1 illustrates an expanded configuration of a ’C4x system with differ- 
ent types of external devices and the interfaces to which they are connected. 


Figure 4—1. Possible System Configurations 


Fast local 
memory Analog I/O 


Peripherals 


Peripherals 


Bit I/O 


I/O devices 


Clock, reset 
generator, etc 


*C4x 


Local bus 
Global bus 


Interrupt Communication 
interface ports 


External flags 


Timer interface 


Timer interface 


System 
control 


Large shared 
memory 


Peripherals 


Peripherals 


’C4x devices | 
I/O devices 


: 


In your design, you can use any subset or superset of the illustrated compo- 


nents. 
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4.2 External Interfacing 


The ’C4x interfaces connect to a wide variety of device types. Each of these 
interfaces is tailored to a particular type of device such as memory, DMA, par- 
allel and serial peripherals, and I/O. In addition, ’C4x devices can interface di- 
rectly with each other, without external logic, through their communication 
ports or their external flag pins IIOF(0-3). Each interface comprises one or 
more signal lines, which transfer information and control its operation. 
Figure 4—2 shows the signal groups for these interfaces. 


Figure 4—2. External Interfaces 
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LSTRB1 control 


[ACK 
RESET CSTRBn 
RESETLOC(1,0) CRDYn 


ROMEN 


x1 
X2/CLKIN TCLK1 |<——> _ and 1/0 flags 


H1 
H3 


TCLKO |< »»__ Timer interface 


TCK 
TDO 
TDI 


Note: n=0 for communication port 0, n = 1 for communication port 1, etc. 


The global and local buses implement the primary memory-mapped interfaces 
to the device. These interfaces allow external devices such as DMA controllers 
and other microprocessors to share resources with one or more ’C4x devices 
through a common bus. 


Memory Interfacing 4-3 


Global and Local Bus Interfaces 


4.3. Global and Local Bus Interfaces 


The ’C4x uses the global and local buses to access the majority of its 
memory-mapped locations. Since these two memory interfaces are identical 
in every way, except for their positions in the memory map, each example in 
this memory interface section focuses on only one of the two interfaces. How- 
ever, all of the examples are applicable to either the local or global bus. The 
buses have identical but mutually exclusive sets of control signals: 


Table 4—1. Local/Global Bus Control Signals 


Global Bus Local Bus 
STRBO LSTRBO 
STRB1 LSTRB1 
CEO LCEo 
CE1 LCE1 
RDYO LRDYO 
RDY1 LRDY1 
AE LAE 

DE LDE 
PAGEO LPAGEO 
PAGE1 LPAGE1 
R/Wo LR/Wo 
RW1 LR/W1 


While both the global bus and the local bus can interface to a wide variety of 
devices, they most commonly interface to memories. 
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4.4 Zero Wait-State Interfacing to RAMs 


A memory-read access time is normally defined as the time between address 
valid and data valid. This time can be determined by: 


Read access time = = te) — (ta(H1L-A) + tsu(D)R) 


where: 

to(H) = _H1/H83 cycle time 

tq(H1L-A) = __H1 low to address valid 

tsu(D)R = Data valid before next H1 low (read) 


For a full-speed, zero wait-state interface to any device, a 50-MHz ’C4x (40-ns 
instruction cycle time) requires aread access time of 21 ns from address stable 
to data valid. For most memories, the access time from chip enable is the same 
as access time from address; thus, it is possible to use 20-ns memories at full 
speed with a 50-MHz ’C4x. However, to use 20-ns memories properly, you 
must avoid long delays between the processor and the memories. 


Avoiding these delays is not always possible, because interconnections and 
gating for chip-enable generation can cause them. In addition, if you choose 
amemory device with an output enable, the output enable must become active 
quickly enough to ensure that the memory can meet the data valid timing 
requirements of the ’C4x. For memories with 20-ns access times, the output 
enable active to data valid timing parameter is typically less than 10 ns. 


Currently available RAMs without output-enable (OE) control lines include the 
1-bit wide organized RAMs and most of the 4-bit wide RAMs. Those with OE 
controls include the byte-wide and a few of the 4-bit wide RAMs. Many of the 
fastest RAMs do not provide OE control; they use chip-enable (CE) controlled 
write cycles to ensure that data outputs do not turn on for write operations. In 
CE-controlled write cycles, the write control line (WE) goes low before CE goes 
low, and internal logic holds the outputs disabled until the cycle is completed. 
Using CE-controlled write cycles is an efficient way to interface fast RAMs 
without OE controls to the ’C4x at full speed. 


aT | 
Note: 


You can find timing parameters for CLKIN, H1, H3, and memory in the 
TMS320C40 and TMS320C44 data sheets. 
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4.4.1 


Consecutive Reads Followed by a Write Interface Timing 


Figure 4—3 shows the timing of consecutive reads followed by a write. For con- 
secutive reads, LSTRBO stays active (low), and LR/W stays high as long as 
read cycles continue. For back-to-back reads, the ‘C4x requires zero-wait- 
state memories to have an address-valid to data-valid time of less than 21 ns. 


For most memory devices, this time is the same as the memory access time, 
which is ty = 20 ns. Thus, memories with access times of 25 ns or more cannot 
meet this timing. 


Memory device timing is not as critical for zero-wait-state as for nonzero-wait- 
state write cycles, because of the two H1 cycle writes of the ’C4x. The extra 
cycle gives LSTRBO enough time to frame LR/W, preventing memories that 
go into high impedance slowly at the end of a read cycle from driving the bus 
during the subsequent write cycle. For the memory device used in this design 
(Figure 4—3), the data lines are guaranteed to into high impedance (to = 10 ns) 
after CS goes inactive, which gives more than 23 ns of margin before the ’C4x 
starts driving the bus with write data. Also, the extra cycle with LSTRBO 
inactive prevents writes to random locations in memory while the address is 
changing between consecutive writes. 


For the write cycles shown in Figure 4—3 and Figure 4—4, the RAM requires 
15ns of write data setup before CS goes high, and this design provides at least 
24 ns (tg). A data hold time of 0 ns (tq) is required by the RAM, and this design 
provides greater than 13 ns. Finally, the RAM’s 20-ns setup and O-ns hold 
times for address (with respect to CS high) ensure a clear margin. 


Figure 4—3. Consecutive Reads Followed by a Write 
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Figure 4—4. Consecutive Writes Followed by a Read 
Ff WF UP V7 17 
LR/Wo / 


— ty ee 
| 


STRBO Vf NNT NN 
| 


f— t3 | 


LD(31-0) Valid write data Valid write data Valid data 
LA(30-—0) Valid write address Valid write address Read address 


4.4.2 Consecutive Writes Followed by a Read Interface Timing 


Figure 4—4 shows the timing of consecutive writes followed by a read. Notice 
that between consecutive writes, LR/W stays low, but STRBO goes inactive to 
frame the write cycles. Although ’C4x zero-wait-state writes take two H1 
cycles, writes appear to take one cycle internally (from the perspective of the 
CPU and DMA) if no access to the interface is already in progress. 


In the read cycle following the writes in Figure 4—4, the ’C4x requires zero-wait- 
state memories to have a LSTRB-active to data-valid time of less than 21 ns 
(one H1 cycle minus (H1 low to LSTRB active plus data setup before H1 low)). 
For most memory devices, this time is the same as the memory access time, 
which is ty = 20 ns in this design. Thus, a margin of only 1 ns exists, leaving 
little allowance for STRB gating if desired. 


4.4.3. RAM Interface Using One Local Strobe 


Figure 4—5 shows the ’C4x’s local bus interfaced to eight Integrated Device 
Technology IDT71258 20-ns 64K x 4-bit CMOS static RAMs with zero wait 
states using chip-enable controlled write cycles. The SRAMs are arranged to 
implement the first 64K, 32-bit words in external memory, located at addresses 
00000h thru OFFFFh (internal ROM is assumed to be disabled). If these 64K 
words of SRAM are the only memory controlled by LSTRBO, the LSTRB AC- 
TIVE field of the local memory interface control register (LMICR) should be set 
to its minimum value of 011119, allowing LSTRBO to be active for only the first 
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64K words of the ’C4x’s memory space. In addition, if this memory is the only 
memory interfaced to LSTRBO, LSTRBO requires only one page, and the PA- 
GESIZE field of the LMICR should be set to 011115. Also note that in 
Figure 4—5, the LRDYO input is tied low, selecting zero wait states for all 
LSTRBO accesses on the local bus. With all of the zero-wait-state memory 
controlled by LSTRBO, LSTRB1 can be used to control accesses to slower 
read-only memory devices or other types of memory. 


Figure 4—5. 'C4x Interface to Eight Zero-Wait-State SRAM 


IDT71258 SRAM 


IDT71258 SRAM 


In this circuit implementation, no external logic is necessary to interface the 
’C4x to the memory device. Typically, memory devices must be held inactive 
(CS inactive) during changes in WE; this avoids undesired memory accesses 
while the address changes. The ’C4x ensures this glueless interface because 
LSTRB always frames changes in LR/W. 


4.4.4 RAM Interface Using Both Local Strobes 


Figure 4-6 shows the ’C4x’s local bus interfaced to HM6708 — 20-ns 64K x 
4-bit CMOS static RAMs with zero wait states using CS controlled write cycles. 
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These RAMs are arranged to allow 128K 32-bit words of local memory, which 
are implemented as two 64K x 32-bit banks. One bank is controlled by each 
of the two sets of control signals on the local bus. To map these memory de- 
vices properly in the ’C4x’s memory space, you must use the local-memory-in- 
terface control register (LMICR) to define which part of the local bus’s memory 
space is mapped to each of the two strobes. In this implementation with inter- 
nal ROM disabled, LSTRBO is mapped to the first 64K words of the local space 
(addresses Oh through OFFFFh), and LSTRB1 is mapped to the rest of the lo- 
cal space (addresses 10000h through 7FFF FFFFh). For this memory config- 
uration, the LSTRB ACTIVE field of the local-memory-interface control regis- 
ter (LMICR) should be set to 011119. Also, each LSTRB requires only one 
page. The PAGESIZE field of the LMICR should be set to 011115. Note that in 
Figure 4-6, the LRDY inputs are tied low, selecting zero wait states for all ac- 
cesses on the local bus. 


Hence, through the use of the ’C4x’s four strobes (two each on the local and 
global buses), four different banks of memory can be decoded. In addition, 
through program control, you can change the address decoding under pro- 
gram control by changing the LSTRB active field (bits 24—28) of the LMICR or 
the global-memory-interface control register (GMICR). If you must decode 
more than four banks of memory or if the chosen memory device cannot meet 
the read cycle timing requirements for the ’C4x at zero wait states, you should 
use page switching (discussed in subsection 4.5.6 on page 4-18) to add an ex- 
tra cycle to read accesses outside the current bank boundary. 
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Figure 4—6. 'C4x Interface to Zero-Wait-State SRAMs, Two Strobes 


8 x HM6708 SRAM 8 x HM6708 SRAM 
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16 
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4.5 Wait States and Ready Generation 


Using wait states can greatly increase a system’s flexibility and reduce its 
hardware requirement. The ’C4x is capable of generating wait states on either 
the global bus or the local bus, and both buses have independent sets of ready 
control logic. The buses’ wait-state configuration is determined by the SWW 
and WTCNT fields of the local and global-bus-interface control registers. 


This section discusses ready generation from the perspective of the global- 
bus interface; however, wait-state operation on the /ocal bus is the same as 
on the global bus, so this discussion pertains equally well to both (local and 
global). Also, the local and global buses each have two sets of control signals 
— R/W0O, STRBO, RDYO, PAGEO, CEO and R/W1, STRB1, RDY1, PAGE1, 
CE1— with each set of control signals having its own ready signal, providing 
for more flexibility in support of external devices with different speeds. Since 
both strobes’ ready signals share the same electrical characteristics, the fol- 
lowing discussion focuses on one of the global bus’s set of control signals. 


Wait states are generated by: 

.) The internal wait-state generator 

(The external ready inputs (RDYO or RDY1) 

[1 The logical AND or OR of the two ready signals 


When enabled, internally generated wait states affect all external cycles, re- 
gardless of the address accessed. If different numbers of wait states are re- 
quired for various external devices, the external RDY input can be used to cus- 
tomize wait-state generation to specific system requirements. 


If either the logical OR or electrical AND (since the signals are true low) of the 
external and wait-count ready signals is selected, the earlier of the two signals 
will generate a ready condition and allow the cycle to be completed. It is not 
required that both signals be present. 


Memory Interfacing 4-11 


Wait States and Ready Generation 


4.5.1 


ORing of the Ready Signals (STRBx SWW = 10) 


You can use the OR of the two ready signals to implement wait states for de- 
vices that require more wait states than internal logic can implement (up to 
seven). This feature is useful, for example, if a system contains some fast and 
some slow devices. In this case: 


[j Fast devices can generate ready externally with a minimum of logic. 
When fast devices are accessed, the external hardware responds prompt- 
ly with ready, which terminates the cycle. 


j) Slow devices can use the internal wait counter for larger numbers of wait 
states. When slow devices are accessed, the external hardware does not 
respond, and the cycle is appropriately terminated after the internal wait 
count. 


The OR of the two ready signals can also terminate the bus cycle before the 
number of wait states implemented with external logic allows termination. In 
this case, a shorter wait count is specified internally than the number of wait 
states implemented with the external ready logic, and the bus cycle is termi- 
nated after the wait count. Also, this feature can be used as a safeguard 
against inadvertent accesses to nonexistent memory that would never re- 
spond with ready and would, therefore, lock up the ’C4x. 


If the OR of the two ready signals is used, however, and the internal wait-state 
count is less than the number of wait states implemented externally, the 
external ready generation logic must be able to reset its sequencing to allow 
anew cycle to begin immediately following the end of the internal wait count. 
Also, the consecutive cycles must be from independently decoded areas of 
memory (or from different pages in memory). Otherwise, the external ready 
generation logic may lose synchronization with bus cycles and generate 
improperly timed wait states. 


4.5.2 ANDing of the Ready Signals (STRBx SWW = 11) 
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If the logical AND (electrical OR) of the wait count and external ready signals 
is selected, the later of the two signals will control the internal ready signal, but 
both signals must be asserted. Accordingly, external ready control must be im- 
plemented for each wait-state device, and the wait count ready signal must be 
enabled. 


This feature is useful if devices in a system are equipped to provide a ready 
signal but cannot respond quickly enough to meet the ’C4x’s timing require- 
ments. If these devices normally indicate a ready condition and, when ac- 
cessed, respond with a wait until they become ready, the logical AND of the 
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two ready signals can be used to save hardware in the system. In this case, 
the internal wait counter can provide wait states initially, and then the external 
ready can provide wait states after the external device has had time to send 
a not-ready indication. The internal wait counter then remains ready until the 
external device also becomes ready, which terminates the cycle. 


Additionally, the AND of the two ready signals can be used for extending the 
number of wait states for devices that already have external ready logic imple- 
mented, but require additional wait states under certain unique circumstances. 


4.5.3 External Ready Generation 


The optimum technique for implementing external ready generation hardware 
depends on the specific characteristics of the system, including the relative 
number of wait-state and nonwait-state devices in the system and the 
maximum number of wait states required for any one device. The approaches 
discussed here are intended to be general enough for most applications and 
are easily modifiable to comprehend many different system configurations. 


In general, ready generation involves the following three functions: 
1) Segmentation of the address space to distinguish fast and slow devices 
2) Generation of properly timed ready indications 


3) Logical ORing of all the separate ready timing signals together to 
connect to the physical ready input 


Segmentation of the address space is required to obtain a unique indication 
of each particular area within the address space that requires wait states. This 
segmentation is commonly implemented in the form of chip-select generation. 
Chip-select signals can initiate wait states in many cases; however, 
occasionally, chip-select decoding considerations may provide signals that do 
not allow ready input timing requirements to be met. In this case, you can seg- 
ment coarse address space on the basis of a small number of address lines, 
where simpler gating allows signals to be generated more quickly. In either 
case, the signal that indicates that a particular area of memory is being 
addressed also normally initiates the ready or wait-state signal. 


When address space to be accessed has been established, a timing circuit is 
normally used to provide a ready indication to the processor at the appropriate 
point in the cycle to satisfy each device’s unique requirements. 


Finally, since indications of ready status from multiple devices are typically 
present, you should logically OR the signals by using a single gate to drive the 
RDY input. 
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4.5.4 Ready Control Logic 


You can take one of two basic approaches to implement ready control logic, 
depending on the state of the ready input between accesses. If RDY is low be- 
tween accesses, the processor is always ready unless a wait state is required; 
if RDY is high between accesses, the processor will always enter a wait state 
unless a ready indication is generated. 


lf RDY is low between accesses, control of devices that are zero-wait-state 
at full soeed is straightforward; no action is necessary, because ready is al- 
ways active unless otherwise required. Devices requiring wait states, howev- 
er, must drive ready high fast enough to meet the input timing requirements. 
Then, after an appropriate delay, a ready indication must be generated. This 
can be difficult in many circumstances because wait-state devices are in- 
herently slow and often require complex select decoding. 


lf RDY is high between accesses, zero-wait-state devices, which tend to be 
inherently fast, can usually respond immediately with a ready indication. Wait- 
state devices can simply delay their select signals appropriately to generate 
a ready. Typically, this approach results in the most efficient implementation 
of ready control logic. Figure 4—7 shows a circuit of this type, which can be 
used to generate 0, 1, or 2 wait states for multiple devices in a system. 


Figure 4—7. Logic for Generation of 0, 1, or 2 Wait States for Multiple Devices 
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4.5.5 Example Circuit 


Figure 4—7 shows how a single, 7-ns 16R4 programmable logic device (PLD) 
can be used to generate 0, 1, and 2 wait states for multiple devices that are 
interfaced to a 'C4x. In this example, distinct address bits are used to select 
the different wait-state devices. Here, each of the three address lines input to 
the 16R4 corresponds to a different speed device. For a single 16R4 imple- 
mentation, up to nine different address bits can be used to select different 
speed devices. 


The single output, 4Q, of the PLD is connected directly to the RDYO input of 
the ’C4x to signal the completion of a bus access for external wait-state gen- 
eration. Because RDYO is sampled on the falling of H1, the H3 output clock is 
used as the PLD clock input. 


Example 4—1 shows the ready logic equations for programming the 16R4 
PLD. The PLD language used is ABEL. STRBO is an input into the PLD that 
indicates that a valid ’C4x bus cycle is occurring. Also, a delayed version of 
STRBO (synchronized with H1 going high) is provided as the strb_syn_ input 
signal. This delayed signal is needed to avoid problems with a race condition 
that may exist between STRBO going low and H rising. RESET can be used 
to bring the state machine back to the idle state. 


Notice that the RDYO output of the PLD is not registered. An asynchronous 
RDYO signal is necessary to generate a ready signal for zero-wait-state de- 
vices. When a zero-wait-state device is selected (ahi1 high in Example 4—1) 
and STRBO is low, the PLD asserts RDYO low within 7 ns. Hence, RDYO goes 
active fast enough to satisfy the 20-ns setup time of RDYO low before H1 low. 


For generation of RDYO for one and two wait states, the device select address 
bits and sirb_syn_ are delayed one and two cycles, respectively, by the PLD 
before a RDYO is brought active low. The one H3-cycle delay, required for one- 
wait-state device ready generation, corresponds to state wait_one in 
Example 4—1 and the two H3-cycle delay required for two-wait-state devices 
corresponds to state wait_twoa and wait_twob. 


This 16R4 PLD-based design can be used to implement different numbers of 
wait states for multiple devices. More devices can be selected with ’C4x ad- 
dress lines, and a higher number of wait states can be produced with a PLD 
logic. Furthermore, this approach can be used in conjunction with the 'C4x’s 
internal wait-state generator. 
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Example 4—1.PLD Equations for Ready Generation 


0001 module ready_generation 

0002 title’ ready generation logic for 0, 1 and 2 wait state devices interfaced 

0003 to TMS320C4x’ 

0004 

0005 C40u5 device ’P16R4’; 

0006 

0007 “inputs 

0008 h3 Pin 1; 

0009 

0010 

0011 “The following are TMS320C40 address bits used to 

0012 “select the different speed devices. More can be used if 

0013 “necessary. In this example, a zero wait state, a one wait 

0014 “state, and a two wait state device are decoded with these 
“three address bits 

0015 

0016 ahil Pin 2; “when high selects zero wait state device 

0017 ahi2 Pin 3; “when high selects one wait state device 

0018 ahi3 Pin 4; “when high selects two wait state device 

0019 strb0_ Pin 5; “indicates valid TMS320C40 bus cycle 

0020 reset_ Pin 6; “reset signal from TMS320C40 

0021 strb_syn_ Pin 7; "reset strb0_ synchronized with Hl rising edge. 

0022 ‘Noutput 

0023 rdy0_ Pin 12; “ready signal to TMS320C40 

0024 

0025 one_wait Pin 14; “internal flip-flop signal for 1 wait state 

0026 “device ready signal generation 

0027 two_waita Pin 15; “internal flip-flop signal for first of the two 

0028 “wait states for 2 wait state devices 

0029 two_waitb Pin 16; “internal flip-flop signal for second 

0030 “of the two wait states for 2 wait 

0031 "state devices 

0032 

0033 “name substitutions for test vectors 

0034 c;H,;L,X = .C:,1,0,;3Xs7 

0035 

0036 

0037 “state bits 

0038 outstate = [one_wait, two_waita, two_waitb]; 

0039 

0040 idle = “blll; 

0041 wait_one = “b011; 

0042 wait_twoa = “b101; 

0043 wait_twob = %*b110; 

0044 

0045 

0046 state_diagram outstate 

0047 

0048 state idle: 

0049 if (reset_ & ahi2 & !strb_syn_) then wait_one 

0050 else if (reset_ & ahi3 & !strb_syn_) then wait_twoa 
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Example 4—1.PLD Equations for Ready Generation (Continued) 


0051 else idle; 

0052 

0053 

0054 state wait_one: 

0055 GOTO idle; 

0056 

0057 state wait_twoa: 

0058 if (reset_) then wait_twob 

0059 else idle; 

0060 

0061 state wait_twob: 

0062 GOTO idle; 

0063 

0064 equations 

0065 !rdy0_ = reset_ & ((ahil & !strb0O_) # !one_wait # 
'two_waithb) ; 

0066 

0067 @page 

0068 “Test lst level global arbitration logic 

0069 test_vectors 

0070 ({h3,ahil, ahi2,ahi3,strb0_, strb_syn_ reset —> [outstate, rdy0_]) 

0071 c, xX, X, X, xX, xX, L -—> [idle, H ; 

0072 CG, L, H, L, Li, Li, H —> [wait_one, L ; 

0073 c, xX, X, X, X, X, L -> [idle, H 7 

0074 GC; L, L, H, Li, Li, H -> [wait_twoa, H Fi 

0075 GC, x, X, X, X, X, L -> [idle, H 7 

0076 Cc, L, L, H, L, Ly, H -> [wait_twoa, H i 

0077 Gy L, L, H, L, Ly, H -—> [wait_twob, L Hi 

0078 Cen Cy X, X, X, xX, L -—> [idle, H ; 

0079 L, H, L, L, L, L, H —> [idle, L ; 

0080 Cy H, Ly, L, Ly, L, H -—> [idle, L ; 

0081 L, L, L, L, Ly, Li, H -—> [idle, H ; 

0082 or L, H, L, Liz Li, H —> [wait_one, L ; 

0083 G,; XxX; X, X, X, X, H -—> [idle, H ; 

0084 ion L, L, H, L, L, H -> [wait_twoa, H 4 

0085 GC; L, Ly, H, L, L, H -—> [wait_twob, L ; 

0086 on H, Ly, L, i; L, H -> [idle, L 7 

0087 Cy. Oe, X, X, H, H, H -—> [idle, H ; 

0088 c, xX, X, X, H, H, H -—> [idle, H j 

0089 end ready_generation 
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4.5.6 Page Switching Techniques 
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The ’C4x’s programmable page-switching feature can greatly ease system de- 
sign when large amounts of memory or slow external peripheral devices are 
required. This feature provides a time period for disabling all device selects. 
During the interval, slow devices are allowed time to turn off before other de- 
vices have the opportunity to drive the data bus, thus avoiding bus contention. 


When page switching is enabled, any time a portion of the high-order address 
lines changes, as defined by the contents of the STRBO and STRB1 PAGE- 
SIZE fields (in the global and local memory interface control registers), the cor- 
responding STRB and PAGE go high for one full H1 cycle. Provided that STRB 
is included in chip-select decodes, this causes all devices selected by that 
STRB to be disabled during this period. The next page of devices is not en- 
abled until STRB and PAGE go low again. 


If the high-order address lines remain constant during a read cycle, the 
memory access time with page switching is the same as memory access time 
without page switching. In addition, page switching is not required during 
writes, because these write cycles exhibit an inherent one-half H1 cycle setup 
of address information before STRB goes low. Thus, when you use page 
switching for read/write devices, a minimum of half of one H1 cycle of address 
setup is provided for all accesses outside a page boundary. Therefore, large 
amounts of memory can be implemented without wait states or extra hardware 
required for isolation between pages. Also, note that access time for cycles 
during page switching is the same as that of cycles without page switching, 
and, accordingly, full-speed accesses may still be accomplished within each 


page. 


The circuit shown in Figure 4-8 illustrates page switching with the CY7B185 
15-ns 8K x 8 BICMOS static RAM. This circuit implements 32K 32-bit words 
of memory with full-speed zero wait-state accesses within each page. 
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Figure 4—8. Page Switching for the CY7B185 


Bank 0 (4 x CY7B185) 


A 5-ns, 16L8 PLD decodes lines A15 — A13. These lines along with STRBO 
select each of the four pages in this circuit. With the PAGESIZE field of STRBO 
of the global memory interface control register set to OCh, the pages are 
selected on even 8K-word boundaries, starting at location zero in external 
memory space. 


This circuit cannot be implemented without page switching, because the data 
output’s turn-on and turn-off delays cause bus conflicts, and full-speed 
accesses do not allow enough time for chip-select decoding for the four pages. 
Here, the propagation delay of the 16L8 is involved only during page switches, 
where there is sufficient time between cycles to allow new chip-selects to be 
decoded. 


The timing of this circuit for read operations with page switching is shown in 
Figure 4—9. When a page switch occurs, the page address on address lines 
A30 — A13 is updated during the extra H1 cycle while STRBO is high. Then, 
after chip-select decodes have stabilized and the previously selected page 
has disabled its outputs, STRB goes low for the next read cycle. Further 
accesses occur at full speed with the normal bus timings, as long as another 
page switch is not necessary. Write cycles do not require page switching, be- 
cause of the inherent address setup provided in their timings. 


This timing is summarized in Table 4—2. 
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Figure 4—9. Timing for Read Operations Using Bank Switching 
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Table 4—2. Page Switching Interface Timing 
Time Time 
Interval Event Period 
ty H1 falling to address/STRB valid 7 ns 
to STRB to select delay 5 ns 
tg Memory disable from select 8ns 
t H1 falling to STRB 7 ns 
t5 STRB to select delay 5ns 
tg Memory output enable delay 3ns 
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4.6 Parallel Processing Through Shared Memory 


The ’C4x’s two memory interfaces allow flexibility to design shared-memory 
interfaces for parallel processing. Many processors can be linked together in 
a wide variety of network configurations through these ports. In this section, 
Figure 4—10 illustrates ’C4x shared-memory networks that you can use to fulfill 
many signal processing system needs. 


Figure 4-10. ’C4x Shared/Distributed-Memory Networks 


4.6.1 
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Shared Global-Memory Interface 


One of the most common multiprocessor configurations is the sharing of 
memory by all processors in a system. Shared memory is typically 
implemented by tying the processors’ data and address lines together. Howev- 
er, the shared memory interface must guarantee that no more than one 
processor is driving the shared bus at any one time; it must also allow all 
processors sharing the bus to have a chance to access shared resources. 


The ’C4x supports shared memory multiprocessing with its identical global- 
and local-port interfaces. Both interfaces have four status output signals, 
(L)STAT3—0, which identify what type of access is beginning on the bus. These 
signals identify whether the ’C4x portis idle, a DMA read is occurring, aSTRB1 
write is occurring, a LOCKed access to memory is pending, etc. The signals 
can be interpreted by the interface to issue single access or locked access bus 
requests to a shared bus arbiter. 


The (L)CE, (L)AE, and (L)DE input signals support shared address control and 
data lines. When the signals are disabled (high), they put the port’s control 
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signals, address lines, and data lines, respectively, in the high-impedance 
state. These bus enable lines are asynchronous inputs to the 'C4x, which can 
quickly turn off bus drivers when another processor is accessing a shared 
resource. However, these signals asynchronously turn off the ’C4x’s local and 
global buses, without memory accesses being suspended. To ensure that data 
written is seen externally and data read is valid, you should use the external 
(L)RDY should be used for wait-state generation in shared memory designs. 
An (L)RDY signal should not be sent to the ’C4x until the processor has 
regained access to the bus (CE, AE, DE enabled) and has had enough time 
to complete its access. Hence, with bus enable and status signals, the ’C4x 
flexible bus interfaces easily implement high-speed shared bus configura- 
tions. 


4.6.2 Shared-Memory Interface Design Example 
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For an example of a’C4x shared-memory interface, see the TMS320C4x Par- 
allel Processing Development System Technical Reference (SPRUO75). In 
the example in that text, four ’C4x devices share SRAM with their global buses 
tied together. A bus arbitrator implemented as a programmable logic device 
provides a fair scheme for processor access to the shared bus. The design 
uses high-speed parts but employs a fully asynchronous handshake protocol 
that allows ’C4x devices of various speeds and also processors other than 
’C4x devices to be added to this bus configuration. 


The shared-memory interface in the PPDS works for ’C4x devices running at 
a speed of up to 32 MHz. For higher speeds, the arbitrator incorrectly takes 
away bus masier privileges from a 'C4x between back-to-back reads to the 
same page (the page size is determined by the page size field in the global bus 
control register. The default page size for the PPDS global memory is 64k). 
If this occurs while two or more ’'C4x devices are requesting the bus to perform 
write cycles, random shared memory locations can be corrupted. 


To fix this problem for higher speeds, the busenable_ signal of each ’’C4x local 
interface can be used to generate gmce0_ and gmce1_ to prevent these sig- 
nals from going low (active) if all the processors busenable_ signals are high 
(inactive). The busenable_ signal is shown in the PLD equations in the Global 
Bus Interface Logic section the of the TMS320C4x Parallel Processing Devel- 
opment System Technical Reference). The gmce0 and gmce7 signals are 
shown in the Global Memory Control section of the same book. 
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Programming Tips 


Programming style is highly personal and reflects each individual’s prefer- 
ences and experiences. The purpose of this chapter is not to impose any par- 
ticular style. Instead, it emphasizes some of the features of the 'C4x that can 
help in producing faster and/or shorter programs. The tips in this chapter cover 
both C and assembly language programming. 


Topic Page 
5.1 Hints for: Optimizing’C Code i. 2.91.6 262 e-alerts = 5-2 
5.2 Hints for Optimizing Assembly-Language Code ................. 5-5 
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5.1 Hints for Optimizing C Code 


The ’C4x’s large register file, software stack, and large memory space easily 
support the ’C4x C Compiler. The C compiler translates standard ANSI C pro- 
grams into assembly language source. It also increases the portability and de- 
creases the porting time of applications. 


The suggested methodology for developing your application follows five steps: 


— 


Write the application in C. 


) 
2) Debug the program. 
3) Estimate if the program runs in real-time. 
4) Ifthe program does not run in real time: 


m Use the —o2 or —03 option when compiling 

m Use registers to pass parameters (—mr compiling option) 
m Use inlining (—x compiling option) 

m Remove the —g option when compiling 


m Follow some of the efficient code generation tips listed below. 


5) Identify places where most of the execution time is spent and optimize 
these areas by writing assembly language routines that implement the 
functions. 


The efficiency of the code generated by the floating point compiler depends 
to alarge extent on how well you take advantage of the compiler strengths de- 
scribed above when writing your C code. There are specific constructs that can 
vastly improve the compiler’s effectiveness: 


Lj Use register variables for often—used variables. This is particularly true 
for pointer variables. Example 5-1 shows a code fragment that ex- 
changes one object in memory with another. 


Example 5—1.Exchanging Objects in Memory 


do 
{ 
temp *++SVC; 
*srce *++dest; 
*dest = temp; 
} 
while (--n); 


[j Pre-compute subexpressions, especially array references in loops. As- 
sign commonly used expressions to register variables where possible. 
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Use *++ to step through arrays, rather than using an index to recalculate 
the address each time through a loop. 


As an example of the previous 2 points, consider the loops in Example 5-2: 


Example 5-2. Optimizing a Loop 


7/* loop 1 */ 
main() 
{ 
float a[10], b[10]; 
int i; 
for (4 = 0; i < 10; +434) 
afi] = (a[i] * 20) + blil; 
} 
/* loop 2 */ 
main () 
{ 
float a[10], b[10]; 
int i; 
register float *p = a, *q = b; 
for (i = 0; i < 10; +41) 
*pt+ = (*p * 20) + *qt+; 
} 


Loop 1 executes in 19 cycles. Loop 2, which is the equivalent of loop 1, 
executes in 12 cycles. 


QO 


Use structure assignments to copy blocks of data. The compiler gen- 
erates very efficient code for structure assignments, so nest objects within 
structures and use simple assignments to copy them. 


Avoid large local frames and declare the most often used local vari- 
ables first. The compiler uses indirect addressing with an 8-bit offset to 
access local data. To access objects on the local frame with offsets greater 
than 255, the compiler must first load the offset into an index register. This 
causes 1 extra instruction and incurs 2 cycles of pipeline delay. 


Avoid the large model. The large model is inefficient because the compil- 
er reloads the data-page pointer (DP) before each access to a global or 
static variable. If you have large array objects, use "malloc()” to dynamical- 
ly allocate them and access them via pointers rather than declaring them 
globally. Example 5-3 illustrates two methods for allocating large array 
objects: 
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Example 5—3.Allocating Large Array Objects 


/* Bad Method */ 
int a[100000]; /* BAD */ 


alil = 10; 


/* Good Method */ 


int *a = (int *)malloc(100000); 


/* GOOD */ 
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5.2 Hints for Optimizing Assembly-Language Code 


Each program has particular requirements. Not all possible optimizations 
make sense in every case. The suggestions presented in this section can be 
used as a checklist of available software tools. 


QO 


Use delayed branches. Delayed branches execute in a single cycle; reg- 
ular branches execute in four. The three instructions that follow the 
delayed branch are executed whether the branch is taken or not. If fewer 
than three instructions are used, use the delayed branch and append 
NOPs. Machine cycles (time) are still being saved. 


Use delayed subroutine call and return. Regular subroutine CALL and 
RETS execute in four cycles. You can implement a delayed subroutine call 
by using link and jump (LAJ) and delayed branches with R11 register mode 
(BUD R11) instructions. Both LAJ and BUD instructions execute in a single 
cycle. Guidelines for using the LAd instruction are the same as for delayed 
branches. 


Use the repeat single/block construct. This method produces loops 
with no overhead. Nesting such constructs will not normally increase effi- 
ciency, so try to use the feature on the most often performed loop. The 
RPTBD is a single-cycle instruction, and the RPTS and RPTB are four- 
cycle instructions. RPTBD and delayed branches are used in similar ways. 
Note that RPTS is not interruptible, and the executed instruction is not re- 
fetched for execution. This frees the buses for operands. 


Use parallel instructions. You can have a multiply in parallel with an add 
(or subtract) and stores in parallel with any multiply or ALU operation. This 
increases the number of operations executed in a single cycle. For 
maximum efficiency, observe the addressing modes used in parallel 
instructions and arrange the data appropriately. You can have loads in 
parallel with any multiply or add (or subtract). The result of a multiply by 
one or an add of zero is the same as a load. Therefore, to implement paral- 
lel instructions with a data load, you can substitute a multiply or an add 
instruction, with one extra register containing a one or zero, in place of the 
load instruction. 


Maximize the use of registers. The registers are an efficient way to 
access scratch-pad memory. Extensive use of the register file facilitates 
the use of parallel instructions and helps avoid pipeline conflicts when you 
use register addressing. 


Use the cache. The cache speeds instruction fetches and enables sim- 
ple-cycle access, even with slow external memory. The cache is transpar- 
ent to the user, so make sure that it is enabled. 
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() Use internal memory instead of external memory. The internal 
memory (2K x 32 bits RAM and 4K x 32 bits ROM) is considerably faster 
to access than external memory. In a single cycle, two operands can be 
brought from internal memory. You can maximize performance if you use 
the DMA coprocessor in parallel with the CPU to transfer data you want 
to operate on to internal memory. 


Lj Avoid pipeline conflicts. For time-critical operations, make sure that 
cycles are not missed because of pipeline conflicts. If there is no problem 
with program speed, ignore this suggestion. 


(j Plan your linker command file in advance. Memory allocation for code 
and data sections can have a big impact on your algorithm performance. 
One of the ’C4x’s strengths is its sustained bandwidth achieved by having 
two external busses. By carefully dividing data and program between the 
two busses, you can minimize pipeline conflicts. You need to apply the 
same concept to minimize DMA/CPU access conflicts. 


The above checklist is not exhaustive, and it does not address some features 
in detail. To learn how to exploit the full power of the ’C4x, carefully study its 
architecture, hardware configuration, and instruction set, which are all de- 
scribed in the TMS320C4x User’s Guide (SPRU063). 
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Applications-Oriented Operations 


The ’C4x architecture and instruction set features facilitate the solution of nu- 
merically intensive problems. This chapter presents examples of applications 
that use these features, such as companding, filtering, matrix arithmetic, and 
fast Fourier transforms (FFT). 
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6.1 Companding 
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In telecommunications, one of the primary concerns is to conserve the channel 
bandwidth and, at the same time, to preserve high speech quality. This is 
achieved by quantizing the speech samples logarithmically. It has been 
demonstrated that an 8-bit logarithmic quantizer produces speech quality 
equivalent to that of a 13-bit uniform quantizer. The logarithmic quantization 
is achieved by companding (COMpress/exPANDing). Two _ international 
standards have been established for companding: the u-law (used in the 
United States and Japan), and the A-law (used in Europe). Detailed 
descriptions of u-law and A-law companding are presented in an application 
report on companding routines included in the book Digital Signal Processing 
Applications with the TMS320 Family (literature number SPRAO12A). 


During transmission, logarithmically compressed data in sign-magnitude form 
are transmitted along the communications channel. If any processing is 
necessary, these data should be expanded to a 14-bit (for u-law) or 13-bit (for 
A-law) linear format. This operation occurs when data is received at the digital 
signal processor. After processing, and in order to continue transmission, the 
result is compressed back to 8-bit format and transmitted through the channel. 


Example 6-1 and Example 6-2 show u-law compression and expansion 
(such as linear to u-law and u-law to linear conversion), while Example 6-3 
and Example 6—4 show A-law compression and expansion. For expansion, 
using a look-up table is an alternative approach. It trades memory space for 
speed of execution. Because the compressed data is 8 bits long, a table with 
256 entries can be constructed to contain the expanded data. If the 
compressed data is stored in the register ARO, the following two instructions 
put the expanded data in register RO: 


ADDI @TABL,ARO; @TABL = BASE ADDRESS OF TABLE 
LDI *ARO,RO ; PUT EXPANDED NUMBER IN RO 


The same look-up table approach could be used for compression, but the re- 
quired table length would then be 16,384 words for -law or 8,192 words for 
A-law. If this memory size is not acceptable, you should use the subroutines 
presented in Example 6-1 or Example 6-3. 
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ITLE UW-LAW COMPRESSION 


SUBROUTINE MUCMPR 


TYPICAL CALLING SEQUENCE: 
LAJU MUCMPR 
LDI v, RO 
NOP <= can be other non-pipeline break 
NOP Se=== instructions 
ARGUMENT ASSIGNMENTS: 
ARGUMEN | FUNCTION 
+ 
RO | v = NUMBER TO BE CONVERTED 


REGISTERS USED AS INPUT: RO 
REGISTERS MODIFIED: RO, R1 
REGISTER CONTAINING RESULT: 


+ + + FFF FFF FF FF + FF + F FF F FF F HF 


BENCHMARKS: 

global MUCMPR 

* 

MUCMPR LSH3 -6,R0,R1 
ABSI RO, RO 
CMP I 1FDEH, RO 
LDIG 1FDEH, RO 
ADDI 33,R0 
FLOAT RO 
MPYF 0.03125,R0 
LSH 1,R0 
PUSHF RO 
POP RO 
LSH -—20,R0 
BUD R11 
AND 080H,R1 
ADDI R1, RO 
NOT RO 


CYCLES: 14 (not including the BUD instruction) 
WORDS: 15 (not including the BUD instruction) 


;Save sign of number 


;If RO>Ox1FDE, 

;saturate the result 

;Add bias 

;Normalize: (seg+5) OWXYZx...x 
;Adjust segment number by 2**(-5) 
; (seg) WXYZx...x 


;Treat number as integer 
;Right-justify 

;Delayed return 

;Set sign bit 

;RO = compressed number 

;Reverse all bits for transmission 
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Example 6—2.u-Law Expansion 


* 
*TITLE ‘U-LAW EXPANSION’ 
* 
* SUBROUTINE MUXPND 
* 
* TYPICAL CALLING SEQUENCE: 
* LAJU MUXPND 
* LDI v, RO 
* OP < can be other non-pipeline-break 
* NOP <=-=> instructions 
* 
* ARGUMENT ASSIGNMENTS: 
* 
*  ARGUME | FUNCTION 
* + 
* RO | v = NUMBER TO BE CONVERTED 
* 
* REGISTERS USED AS INPUT: RO 
* REGISTERS MODIFIED: RO, Rl, R2 
* REGISTER CONTAINING RESULT: RO 
* 
* BENCHMARKS: CYCLES: 11/10 (worst/best, not including subroutine overhead) 
* WORDS: 11 (not including subroutine overhead) 
* 
* 
-global MUXPND 
* 
MUXPND NOT RO, RO ;Complement bits 
AND3 OFH,RO,R1 ;Isolate quantization bin 
LSH 1,R1 
ADDI 33, R1 ;Add bias to introduce 1xxxx1l 
LSH3 -4,R0 ; Isolate segment cod 
TSTB 08H, RO ;Test sign 
BZD R11 ;If positive, delayed return 
AND 7,R0 
LSH3 RO,R1,RO ;Shift and put result in RO 
SUBI 33,R0 ;Subtract bias 
BUD R11 ;Delayed return 
NEGI RO ;Negate if a negative number 
NOP 
NOP 
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Example 6—3.A-Law Compression 


ITLE A-LAW COMPRESSION 


SUBROUTINE ACMPR 


TYPICAL CALLING SEQUENCE: 


LAJ ACMPR 
LDI v, RO 
NOP < can be other non-pipeline-break 
NOP <---- instructions 
ARGUMENT ASSIGNMENTS: 
ARGUMEN | FUNCTION 
+ 
RO | v = NUMBER TO BE CONVERTED 
REGISTERS USED AS INPUT: RO 
REGISTERS MODIFIED: RO, R1 
REGISTER CONTAINING RESULT: RO 


BENCHMARKS : CYCLES: 16/10 (worst/best, not including subroutine overhead) 
WORDS: 16 (not including subroutine overhead) 


+ + * FF F FF FF FF F FF F FF + FF HF 


-global ACMPR 


* 


ACMPR LSH3 -5,RO0O,R1l ;Save sign of number 
ABSI RO,RO 
CMP T 1FH, RO ; If RO<0x20, 
BLE END ;do linear coding 
CMPI OFFFH,RO ;If RO>OxFFF, 
LDIGT OFFFH,RO ;saturate the result 
LSH -1,R0 ;Eliminate rightmost bit 
FLOAT RO ;Normalize: (seg+3) OWXYZx...x 
MPYF 0.125,RO ;Adjust segment number by 2%**(-3) 
LSH 1,R0 ; (Seg) WXYZx...x 
PUSHF RO 
POP RO ;Treat number as integer 
LSH -20,R0 ;Right-justify 
END BUD Rid ;Delayed return 
AND 080H,R1 ;Set sign bit 
ADDI R1,RO ;RO = compressed number 
XOR OD5H, RO ; Invert even bits for transmission 
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Example 6—4.A-Law Expansion 


ITLE A-LAW EXPANSIO 


SUBROUTINE AXPND 


TYPICAL CALLING SEQUENCE: 
LAJU AXPND 


LDI v, RO 
OP <S=— can be other non-pipeline-break 
NOP <==== instructions 


ARGUMENT ASSIGNMENTS: 


ARGUMENT | FUNCTION 
+ 
RO | v = NUMBER TO BE CONVERTED 


EGISTERS USED AS INPUT: RO 
EGISTERS MODIFIED: RO, Rl, R2 
EGISTER CONTAINING RESULT: RO 


AAW 


ENCHMARKS: CYCLES: 15/13 (worst/best - not including subroutine overhead) 
WORDS: 15 (not including subroutine overhead) 


+ + FF FF FF FF FF FF HF FF HF FF FF F F FH 
w 


-global AXPND 
* 
AXPND XOR OD5H, RO, R2 ; Invert even bits 
ASH3 -4,R2,R0 ;Store for bit sign 
AND 7,RO ; Isolate segment cod 
BZD SKIP1 
AND3 OFH,R2,R1 ;Isolate quantization bin 
LSH 1,R1 
ADDI 1,R1 ;Create Oxxxxl 
ADDI 32,R1 ,Or 1xxxxl 
SUBI 1,R0 
SKIP1 LSH3 RO,R1,RO ;Shift and put result in RO 
TSTB 80H, R2 ;Test sign bit 
BZAT R11 ;If positive, delayed return and 
jannul next three instructions 
NEGI RO ;Negate if a negative number 
NOP 
NOP 
BU R11 ; Return 
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6.2 FIR, IIR, and Adaptive Filters 


6.2.1 


FIR Filters 


Digital filters are a common requirement for digital signal processing systems. 
There are two types of digital filters: finite impulse response (FIR) and infinite 
impulse response (IIR). Each of these types can have either fixed or adaptable 
coefficients. In this section, the fixed-coefficient filters are presented first, and 
then the adaptive filters are discussed. 


If the FIR filter has an impulse response h[0], h[1],..., h[N—1], and x[n] repre- 
sents the input of the filter at time n, the output y[n] at time n is given by this 
equation: 


y[n] = h[O] x{n] + h[1] x[n—1] + ... + h[N-1] x[n—(N-1)] 


Two features of the ’C4x that facilitate the implementation of the FIR filters are 
parallel multiply/add operations and circular addressing. The first permits the 
performance of a multiplication and an addition in a single machine cycle, while 
the second makes a finite buffer of length N sufficient for the data x. 


Figure 6—1 shows the arrangement of the memory locations to implement cir- 
cular addressing, while Example 6-5 presents the ’C4x assembly code for an 
FIR filter. 


Figure 6—1. Data Memory Organization for an FIR Filter 


impulse initial final 
response input samples input samples 
ow 
address h(N -1) oldestinput | x[n-(N-1)] x(n) 
h(N — 2) x[n — (N -2)] x[n — (N -1)] 
e C C 
e e ® circular 
e © é queue 
h(1) x(n —1) x(n — 2) 
high h(0) newest input x(n) x(n -1) 
address 


To set up circular addressing, initialize the block-size register BK to block 
length N. Also, the locations for signal x should start from a memory location 
whose address is a multiple of the smallest power of 2 that is greater than N. 
For instance, if N = 24, the first address for x should be a multiple of 32 (the 
lower 5 bits of the beginning address should be zero). To understand see Cir- 
cular Addressing in the TMS320C4x User’s Guide. 


In Example 6-5, the pointer to the input sequence x is incremented and as- 
sumed to be moving from an older input to a newer input. At the end of the sub- 
routine, AR1 will point to the position for the next input sample. 
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Example 6—5.FIR Filter 


* 
* TITLE FIR FILTER 
* 
* 
* SUBROUTINE FIR 
* 
* EQUATION: y(n) = h(O) * x(n) + A(1) * x(n-1) + 
~ ... + h(N-1) * x(n-(N-1)) 
* 
* TYPICAL CALLING SEQUENCE 
* 
* LOAD ARO 
* LAJU FIR 
* LOAD ARL 
* LOAD RC 
* LOAD BK 
* 
* 
* ARGUMENT ASSIGNMENTS: 
* 
* ARGUMENT | FUNCTION 
* + 
* ARO | ADDRESS OF h(N-1) 
* AR1 | ADDRESS OF x(N-1) 
7 RC | LENGTH OF FILTER - 2 (N-2) 
* BK | /ENGTH OF FILTER (N) 
* 
* REGISTERS USED AS INPUT: ARO, AR1, RC, BK 
* REGISTERS MODIFIED: RO, R2, ARO, AR1, RC 
* REGISTER CONTAINING RESULT: RO 
* 
* 
* BENCHMARKS: CYCLES: 3 + N (not including subroutine overhead) 
* WORDS: 6 (not including subroutine overhead) 
* 
* 
FIR -global FIR 
* 
RPTBD CONV ;Set up the repeat cycle 
* Initialize RO: 
PYF3 *ARO++(1),*AR1++(1)%,RO ;h(N-1) *x(n-(N-1)) -—>RO 
LDF 0.0,R2 ;Initialize R2 
OP 
* 
* FILTER (1 <= i < N) 
* 
CONV PYF3 *ARO++(1),*AR1++(1)%,RO ;h(N-1-i) *x(n-(N-1-i) )->RO 
| | ADDF3 RO, R2,R2 ;Multiply and add operation 
* 
BUD R11 ;Delayed return 
ADDF RO,R2,RO ;Add last product 
NOP 
NOP 
* 
* end 
* 
-end 
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6.2.2 IIR Filters 


The transfer function of the IIR filters has both poles and zeros. Its output de- 
pends on both the input and the past output. As a rule, the filters need less 
computation than an FIR with similar frequency response, but the filters have 
the drawback of being sensitive to coefficient quantization. Most often, the IIR 
filters are implemented as a cascade of second-order sections called biquads. 
Example 6-6 and Example 6—7 show the implementation for one biquad and 
for any number of biquads, respectively. 


y[n] = a1 y[n—1] + a2 y[n—2] + bO x[n] + 61 x[n—1] + b2 x[n-2] 

However, the following two equations are more convenient and have smaller 
storage requirements: 

d[n] = a2 d[n—2] + a1 d[n—1] + x[n] 

y[n] = b2 d[n—2] + b1 d[n—1] + b0 d[n] 


Figure 6—2 shows the memory organization for this two-equation approach to 
the implementation of a single biquad on the ’C4x. 


Figure 6—2. Data Memory Organization for a Single Biquad 
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As in the case of FIR filters, the address for the start of the values d must be 
a multiple of 4; that is, the last two bits of the beginning address must be zero. 
The block-size register BK must be initialized to 3. 
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Example 6-6. IIR Filter (One Biquad) 


* TITLE IIR FILTER 
* 
* SUBROUTINE IIR1 
* 
< IIR1 == IIR FILTER (ONE BIQUAD) 
* 
* EQUATIONS: d(n) = a2 * d(n-2) + al * d(n-1) + x(n) 
* y(n) = b2 * d(n-2) + bl * d(n-1) + bO * d(n) 
* 
* OR y(n) = al*y(n-1) + a2*y(n-2) + bO*x(n) + b1*x(n-1) 
i + b2*x(n-2) 
* 
* 
* TYPICAL CALLING SEQUENCE: 
* 
* load R2 
*  LAJU IIRL 
* load ARO 
* load AR1 
* load BK 
* 
* 
* ARGUMENT ASSIGNMENTS: 
* ARGUMENT FUNCTION 
i 4+—-————-—-—-—-—-—-———-—-———————— — — — — — — — — — — — — 
* R2 INPUT SAMPLE X(N) 
* ARO ADDRESS OF FILTER COEFFICIENTS (A2) 
* AR1 ADDRESS OF DELAY MODE VALUES (D(N-2)) 
* BK BK = 3 
* 
* REGISTERS USED AS INPUT: R2, ARO, AR1, BK 
* REGISTERS MODIFIED: RO, Rl, R2, ARO, AR1 
* REGISTER CONTAINING RESULT: RO 
* 
- BENCHMARKS: CYCLES: 7 (not including subroutine overhead) 
* WORDS: 7 (not including subroutine overhead) 
* 
* 
-global LIRR. 
* 
IIR1 MPYF3 *ARO, *AR1,RO ya2 * d(n-2) -> RO 
MPYF3 *++BRO(1),*ARI--(1)%,R1 ;b2 * d(n-2) -> R1 
* 
MPYF3 *++ARO(1),*AR1,RO jal * d(n-1) -> RO 
iI ADDF3 RO, R2,R2 pa2*d(n-2)+x(n) -> R2 
* 
MPYF3 *++4AR0(1),*AR1--(1)%,RO ;b1 * d(n-1) -> RO 
I | ADDF3 RO,R2,R2 ;al*d(n-1)+a2*d(n-2) 
ptE(N) S> RZ 
* 
BUD R11 ;Delayed return 
* 
MPYF3 *++AR0(1),R2,R2 7b0 * d(n) -> R2 
I | STF R2,*AR1++(1)% ;Store d(n) and point to d(n-1) 
* 
ADDF RO,R2 j;b1*d(n-1)+b0*d(n) -> R2 
ADDF R1,R2,R0 ;b2*d(n-2) +b1*d(n-1) 
7+bO*d(n) -> RO 
* 
* end 
* 
-end 
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Generally, the IIR filter contains N>1 biquads. The equations for its implemen- 
tation are given by the following pseudo-C language code: 


y[0,n] = x[n] 

for (i=0; i<N; i++){ 
d[i,n] = a2[i] d[i,zn—2] + at[i] d[izn—1] + y[i-1,n] 
y[i,n] = b2[i] d[i-2] + b1[i] d[i,n—1] + bOfi] d[i,n] 


} 
y(n] = y[N—1,n] 


Figure 6-3 shows the memory organization, and Example 6-7 shows the cor- 
responding 'C4x assembly-language code. 


Figure 6—3. Data Memory Organization for N Biquads 


filter initial delay final delay 
coefficients node values node values 


low 
address 


circular queue 


circular queue 


The block size register BK should be initialized to 3, and each set of d values 
(i.e., d[i,n], i = 0...N—1) should begin at an address that is a multiple of 4 (the 
last two bits zero), as stated in the case of a single biquad. 
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Example 6—7.1IR Filter (N > 1 Biquads) 


* 
* TITLE IIR FILTER (N > BIQUADS) 
* 
~ SUBROUTINE IIR2 
* 
* EQUATIONS: y(0,n) = x(n) 
* 
* FOR (i = 0; i < N; i++) 
* 4 
* d(i,n) = a2(i) * d(i,n-2) + al(i) * d(i,n-1) * y(i-1,n) 
* y(i,n) = b2(i) * d(i,n-2) + bl(i) * d(i,n-1) * bO(i) * d(i,n) 
* 3 
* y(n) = y(N-1,n) 
* 
* TYPICAL CALLING SEQUENCE 
* 
* load R2 
* load ARO 
* load ARI 
* load IRO 
* LAJU  IIR2 
* load IRI 
* load BK 
* load RC 
* 
* ARGUMENT ASSIGNMENT: 
* ARGUMENT FUNCTION* 
* Seto eee, ae ee ie Soe eee ee eee eee ee eee eee 
* R2 INPUT SAMPLE x(n) 
* ARO ADDRESS OF FILTER COEFFICIENTS (a2(0)) 
* AR1 ADDRESS OF DELAY NODE VALUES (d(0,n-2)) 
* BK BK = 3 
* IRO IRO = 4 
* IR1 IRL = 4*N-4 
* RC NUMBER OF BIQUADS (N) ~-2 
* 
* REGISTERS USED AS INPUT; R2, ARO, AR1, IRO, IR1, BK, RC 
* REGISTERS MODIFIED; RO, Rl, R2, ARO, AR1, RC 
* REGISTERS CONTAINING RESULT: RO 
* 
* BENCHMARKS: CYCLES: 2 + 6N (not including subroutine overhead) 
* WORDS: 15 (not including subroutine overhead) 
* 
* 
-global IIR2 
* 
IIR2 MPYF3 *ARO, *AR1,RO 7a2(0) * d(0,n-2) -> RO 
MPYF3 *AROt++ (1), *ARI——(1)%,R1;b2(0) * d(0,n-2) -> R1 
* 
RPTBD LOOP ;Set loop for 1 <=i<n 
* 
MPYF3 *++ARO (1), *AR1,RO jal(0) * D(0,n-1) -—> RO 
| | ADDF RO,R2,R2 ;First sum term of d(0,n). 
* 
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Example 6—7.IIR Filter (N > 1 Biquads) (Continued) 


| 
** LOOP STARTS HERE 
* 


* 


LOOP 


* 


MPYF3 
ADDF3 
MPYF3 
STF 


*++ARO0 (1), *ARI—-—(1)%,RO ;b1(0) * d(0,n-1) -—> RO 
RO,R2,R2 ;Second sum term of d(0,n) 
*++AR0(1),R2,R2 7b0(0) * d(0,n) -> R2 
R2,*AR1--(1)% 7 Store d(0,n) point to d(0,n-2) 


MPYF3 
ADDF3 


PYF3 
ADDF3 
PYF3 
ADDF3 


PYF3 
ADDF3 


PYF3 
STF 


* FINAL SUMMATION 
* 


+ 


end 


ADDF3 
BRD 


ADDF 
NOP 
NOP 


.end 


*++AR0 (1), *++AR1(IRO),RO;a2(i)* d(i,n-2) —> RO 
RO,R2,R2 ;First sum term of y(i-1,n) 
7;Pipeline hit on previous 
; instruction 


*++ARO(1),*ARI—-(1)%,R1;b2(i) * D(i,n-2) -> R1 
R1,R2,R2 ;Second sum term of y(i-1,n). 
*++ARO(1),*AR1,RO jal(i) * d(i,n-1) -> RO 
RO,R2,R2 ;First sum term of d(i,n) 


*++ARO (1),*ARI—-(1)%,RO;b1 (i) * d(i,n-1) -> RO 


RO;R2,R2 ;Second sum term of d(i,n). 
*++AR0(1),R2,R2 7;bO(i) * d(i,n) -> R2 

R2, *AR1—-(1)% ;Store d(i,n) point to d(i,n-2) 
R1,R2,R0 ;Second sum term of y(n-1,n 
Ril ;Delayed return 

RO,R2 ;First sum term of y(n-1,n) 
*AR1-——(IR1) ;Return to first biquad 
*ARI-—-(1)% ;Point to d(0,n-1) 


6.2.3 Adaptive Filters (LMS Algorithm) 


In some applications in digital signal processing, a filter must be adapted over 
time to keep track of changing conditions. The book Theory and Design of 
Adaptive Filters by Treichler, Johnson, and Larimore (Wiley-Interscience, 
1987) presents the theory of adaptive filters. Although in theory, both FIR and 
IIR structures can be used as adaptive filters, the stability problems and the 
local optimum points that the IIR filters exhibit make them less attractive for 
such an application. Hence, until further research makes IIR filters a better 
choice, only the FIR filters are used in adaptive algorithms of practical applica- 
tions. 


In an adaptive FIR filter, the filtering equation takes this form: 
y[n] = h[n,O] x[n] + h[n,1]x[n—1] +...+ hn, N—1]x[n—(N—1)] 


The filter coefficients are time-dependent. In a least-mean-squares (LMS) al- 
gorithm, the coefficients are updated by an equation in this form: 
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h[n+1,i] = h[n,1] + b x{n-i], i= 0, 1, ..., N-1 


bis aconstant for the computation. The updating of the filter coefficients can 
be interleaved with the computation of the filter output so that it takes 3 cycles 
per filter tap to do both. The updated coefficients are written over the old filter 
coefficients. Example 6-8 shows the implementation of an adaptive FIR filter 
on the ’C4x. The memory organization and the positioning of the data in 
memory should follow the same rules as the above FIR filter with fixed coeffi- 
cients. 


FIR, IIR, and Adaptive Filters 


Example 6—8.Adaptive FIR Filter (LMS Algorithm) 


++ + + + + + + FF + FF FFF F FF FF FF + FF + FF + FF F FF FF FH 


= 


TITLE ADAPTIVE FIR FILTER (LMS ALGORITHM) 


SUBROUTINE LMS 


LMS == LMS ADAPTIVE FILTER 

EQUATIONS: y(n) = h(n,0) *x ( 
+ h(n, N-1) *x (n-( 

FOR (i = 0; i < Nj i 
+ tmuerr * x(n-i 


TYPICAL CALLING SEQUENCE: 


load R4 
load ARO 
LAJU LMS 
load AR1 
load RC 
load BK 


ARGUMENT ASSIGNMENTS: 


n) + h(n,1)*x(n-1) + 
N-1)) 

++) h(nt1,i) = h(n,i) 
) 


ARGUMENT FUNCTION 

R4 SCALE FACTOR (2 * mu * err) 

ARO ADDRESS OF h(n,N-1) 

ARL ADDRESS OF x(n-(N-1)) 

RC ENGTH OF FILTER — 2 (N-2) 

BK \ENGTH OF FILTER (N)* 
REGISTERS USED AS INPUT: R4, ARO, AR1, RC, BK 
REGISTERS MODIFIED: RO, Rl, R2, ARO, AR1, RC 
REGISTER CONTAINING RESULT: RO 


BENCHMARKS: CYCLES: 4+ 


-global LMS 


RPTBD LOOP 
Initialize RO: 
MPYF3 *ARO, *AR1, RO 
SUBF3 R2, R2, R2 
Initialize Rl: 
MPYF3 *AR1++(1)%,R4,R1 
ADDF3 *ARO++(1),R1,R1 


FILTER AND UPDATE (1 <= 1 < N) 
Filter: 


3N (not including subroutine overhead) 


PROGRAM SIZE: 9 words (not including subroutine overhead) 


;Setup the delayed repeat block 


jh(n,N-1) * x(n-(N-1)) -> RO 
; Initialize R2 


7x(n-(N-1)) * tmuerr -> RI 
;h(n,N-1) + x(n-(N-1)) * 
;tmuerr -—> R1 


MPYF3 *ARO--—(1),*AR1,RO ;h(n,N-1-i) * x(n-(N-1-i)) -> RO 

ADDF3 RO,R2,R2 ;Multiply and add operation. 
UPDATE: 

MPYF3 *ARI++(1)%,R4,R1 ;x(n,N-(N-1-i)) * tmuerr -> R1 

STF R1, *ARO++ (1) ;R1L -—> h(nt+1,N-1-(i-1)) 
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Example 6—8.Adaptive FIR Filter (LMS Algorithm) (Continued) 


LOOP 


* 


* 


* 


end 


ADDF3 


.end 


*ARO++(1),R1,R1 


R11 


RO, R2,RO0 
R1, *-ARO (1) 


sh(n,N-1-i) + x(n-(N-1-i)) 
;*tmuerr -> R1 


;Delayed return 
;Add last product. 


shin, 0) + x(m)* tmuerr —> 
;h(nt+l , 0) 
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6.3 Lattice Filters 


The lattice form is an alternative way of implementing digital filters; it has appli- 
cations in speech processing, spectral estimation, and other areas. In this dis- 
cussion, the notation and terminology from speech processing applications 
are used. 


If H(z) is the transfer function of a digital filter that has only poles, A(z) = 1/H(z) 
will be a filter having only zeros, and it will be called the inverse filter. The in- 
verse lattice filter is shown in Figure 6-4. These equations describe the filter 
in mathematical terms: 

f(i,n) = f(i-1,n) + k(i) b(i-1,n—1) 

b(i,n) = b(i-1,n—1) + k(i) f(i-1,n) 


Initial conditions: 
f(0,n) = b(0,n) = x(n) 
Final conditions: 
y(n) = f(p,n) 


In the above equation, f(i,n) is the forward error, b(i,n) is the backward error, 
k(i) is the i-h reflection coefficient, x(n) is the input, and y(n) is the output signal. 
The order of the filter (that is, the number of stages) is p. In the linear predictive 
coding (LPC) method of speech processing, the inverse lattice filter is used 
during analysis, and the (forward) lattice filter is used during speech synthesis. 


Figure 6—4. Structure of the Inverse Lattice Filter 


ay 


f(p —1, n) 
{| = ap. [e+] —— 
ahi n) 


Figure 6—5 shows the data memory organization of the inverse lattice filter on 
the ’C40. 


b(0 


Applications-Oriented Operations 6-17 


Lattice Filters 


Figure 6—5. Data Memory Organization for Inverse Lattice Filters 


reflection backward 
coefficients propagation terms 
low 
address nD b(0, n-1) 
k(2) b(1, n-1) 
C e 
C @ 
C e 
high k(p) b(p -1, n-1) 


address 


Example 6-9. Inverse Lattice Filter 


TITLE INVERSE LATTICE FILTER 


SUBROUTINE LATINV 


LATINV == LATTICE FILTER 


iPC INVERSE FILTER -— ANALYSIS) 


TYPICAL CALLING SEQUENCE: 


* 

* 

* 

* 

* 

* 

* 

* 

* load R2 

* LAJU  LATINV 

* load ARO 

* load ARI 

* load RC 

* 

* 

* ARGUMENT ASSIGNMENTS: 

* ARGUMENT | FUNCTION 

* + 

* R2 | £(0,n) = x(n) 

* ARO | ADDRESS OF FILTER COEFFICIENTS (k(1)) 

* AR1L | ADDRESS OF BACKWARD PROPAGATION VALUES (b(0,n-1)) 

* RC | RC =p - 2 

* 

* REGISTERS USED AS INPUT: R2, ARO, AR1, RC 

* REGISTERS MODIFIED: RO, Rl, R2, R3, RS, RE, RC, ARO, ARI 

* REGISTER CONTAINING RESULT: R2 (f(p,n)) 

* 

BENCHMARKS: CYCLES: 3 + 3p (not including subroutine overhead) 
PROGRAM SIZE: 9 WORDS (not including subroutine overhead) 

* 

* 

* 

* 
.global LATINV 

* 

ek Gie=tl1 

* 

LATINV RPTBD LOOP ;Setup the delayed repeat block loop 
MPYF3 *ARO, *AR1,RO ;k(1) * b(O,n-1) -> RO 

;Assume £(0,n) -> R2. 

LDF R2,R3 sPut b(0;n) = £(0,m) —> BS. 
MPYF3 *ARO++(1),R2,R1 ;k(1) * £(0,n) -—> RL 
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Example 6-9. Inverse Lattice Filter (Continued) 


* 
* 


2 <= i <=p (Repeat block loop start here) 


end 


PYF3 


M 
ADDF3 


+ 1 (CLEANUP) 


.end 


*ARO, *++AR1 (1) ,RO ikea) * B(a-1,n-1) -—> RO 
R2,R0,R2 7h (aati) ued FB (ala na) 
7= £(1-1,n) -> R2 
PO(a=lai ne) a) Fem) 
*-AR1 (1),R1,R3 7= b(i-1,n) -> R3 
R3, *-AR1 (1) wo(a-1-1,n) -—> b(1-—1-Lyn-1) 
*ARO++ (1) ,R2,R1 peau) * £(t=1,n) => RL 


R11 ;Delayed return 

R2,R0,R2 if£(p-1,n) + k(p)*b(p-1,n-1) 
7= f(p,n) -> R2 

*AR1,R1,R3 ib(p-1,n-1) + k(p)*f(p-1,n) 
7= b(p,n) -> R3 

R3, *AR1 ;b(p-1,n) -> b(p-1,n-1) 


The structure of the forward lattice filter, shown in Figure 6-6, is similar to that 
of the inverse filter (also shown in the figure). These corresponding equations 
describe the lattice filter: 


f(i-1,n) = f(i,n) — k(i) b(i-1,n—1) 

b(i,n) = b(i-1,n—1) + k(i) f(i-1,n) 

Initial conditions: 

f(p,n) = x(n), b(i,zn—1) = 0 fori=1,...,p 
Final conditions: 

y(n) = f(0,n). 


The data memory organization is identical to that of the inverse filter shown in 
Figure 6-5. Example 6—10 shows the implementation of the lattice filter on the 
CA4x. 


Figure 6—6. Structure of the Forward Lattice Filter 


f(2, n) f(1, n) 


x(n) = f(p, n) 


n 
> AS > > x > 
—K2 —Kt1 


Kp K2 K1 
b(p, n) b(2, n) b(1, n) 
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Example 6—10. Lattice Filter 


* TITLE LATTICE FILTER 
* 
* SUBROUTINE LATTICE 
* 
* LAJU ATTICE 
* LOAD ARO 
* LOAD AR1 
* LOA RC 
* 
* ARGUMENT ASSIGNMENTS: 
* ARGUMENT | FUNCTION 
* + 
* R2 | F(P,N) = E(N) = EXCITATION 
* ARO | ADDRESS OF FILTER COEFFICIENTS (K(P)) 
* AR1 | ADDRESS OF BACKWARD PROPAGATION 
* | VALUES (B(P-1,N-1)) 
#* RC | RC = P - 2 
* 
7 REGISTERS USED AS INPUT: R2, ARO, AR1, RC 
* REGISTERS MODIFIED: RO, R1, R2, R3, RS, RE, RC, ARO, ARI 
* REGISTER CONTAINING RESULT: R2 (f(0,n)) 
* 
* BENCHMARKS: CYCLES: 1 + 5P (not including subroutine overhead) 
* PROGRAM SIZE: 11 words (not including subroutine overhead) 
* 
-global AATTICE 
* 
LATTICE RPTBD LOOP ;Setup the delayed repeat block loop 
MPYF3 *ARO, *AR1,RO ;K(P) * B(P-1,N-1) -> RO 
SUBF3 RO,R2,R2 ;Assume F(P,N) -> R2 
NOP ;F(P,N) -K(P) *B(P-1,N-1) 
;= F(P-1,N) -—> R2 
* 
x 2 <= 1 <= P (Repeat block loop start here) 
* 
MPYF3 *ARO,R2,R1 ;K(I) * F(I-1,N) -> R1 
MPYF3 *—-ARO(1),*-AR1(1),RO ;K(I-1) * 
;B(I-1-1,N-1) -> RO 
ADDF3 *AR1--(1),R1,R3 ;B(I-1,N-1) + K(1I)*F(I-1,N) 
* 7= B(I,N) -> R3 
STF R3, *+AR1 (2) ;B(I,N) -> B(I,N-1) 
LOOP SUBF3 RO,R2,R2 ;F (I-1,N) -K(I-1) 
;*B(I-1-1,N-1) 
* ;= F(I-1-1,N) -> R2 
* 
* T= 1 (CLEANUP) 
* 
BUD R11 ;Delayed return 
MPYF *ARO,R2,R1 ;K(1) * F(O,N) -> R1 
ADDF3 *AR1,R1,R3 ;B(0,N-1) + K(1)*F(0,N) 
* 7= B(1,N) -> R38 
STF R3,*+AR1 (1) 7;B(1,N) -> B(1,N-1) 
I | STF R2,*AR1 ;F(0,N) -> B(0O,N-1) 
* 
* end 
* 
-end 
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6.4 Matrix-Vector Multiplication 


In matrix-vector multiplication, a K x N matrix of elements m(i,j), having K rows 
and N columns, is multiplied by an N x 1 vector to produce a K x 1 result. The 
multiplier vector has elements v(j), and the product vector has elements pj(i). 
Each one of the product-vector elements is computed by the following expres- 
sion: 


p(i) = m(i,0) v(0) + m(i,1) v(1) +...4 m(i,N-1) v(N-1) i= 0,1,...,K-1 


This is essentially a dot product, and the matrix-vector multiplication contains, 
as a special case, the dot product presented in Example 2-1 on page 2-3 and 
Example 2-2 on page 2-5. In pseudo-C format, the computation of the matrix 
multiplication is expressed by 


for (i = 0; i < K; i++) { 
p(i) = 0 
for (j = 0; j < Nj j++) 
p(i) = p(i) + mii,j) * v(j) 


Figure 6—7 shows the data memory organization for matrix-vector multiplica- 
tion, and Example 6-11 shows the ’C4x assembly code that implements it. 
Note that in Example 6—11, K (number of rows) should be greater than 0, and 
N (number of columns) should be greater than 1. 


Figure 6—7. Data Memory Organization for Matrix-Vector Multiplication 


input result 
matrix SS vector eee vector eS 
ae SS 0) PO) 
| m1) Pvt) a 2 
e e 
e e e 
e e e 
v(N = 1) 
m(1, 0 p(K - 1) 
high m(1, 1) 
address e 
e 
e 
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Example 6-11. Matrix Times a Vector Multiplication 


TITLE MATRIX TIMES A VECTOR MULTIPLICATION 


SUBROUTINE MAT 


MAT 


MATRIX TIMES A VECTOR OPERATION 


TYPICAL CALLING SEQUENCE: 


(not including subroutine overhead) 
PROGRAM SIZE: 10 words (not including subroutine 


* 

* 

* 

* 

* 

* 

* 

* 

* 

x load ARO 

- load AR1 

i load AR2 

* load AR3 

* load R1 

* CALL MAT 

* 

* ARGUMENT ASSIGNMENTS: 

* 

* ARGUMENT | FUNCTION 

Fa i a a 4-----—-—--—~-—-~-—~-—--—-~--~-~------------- 
* ARO ADDRESS OF M(0,0) 

* AR1 ADDRESS OF V(0) 

* AR2 ADDRESS OF P(0) 

* AR3 NUMBER OF ROWS - 1 (K-1) 

* RC NUMBER OF COLUMNS - 2 (N-2) 
* 

7 REGISTERS USED AS INPUT: ARO, AR1, AR2, AR3, RC 

* REGISTERS MODIFIED: RO, R2, ARO, AR1, AR2, AR3, IRO, RC 
* 

* 

* MATRIX -VECTOR BENCHMARKS: CYCLES: 1 + 7K + KN = 1 + K (N + 7) 
* 

* 


overhead) 
* 
* 
-global MAT 
* 
* SETUP 
* 
MAT ADDI3 RC,2,IRO ;IRO =N 
* 
se FOR (i = 0; i < K; i++) LOOP OVER THE ROWS. 
* 
ROWS RPTBD DOT ;Setup multiply a row by a column 
7Set loop counter 
LDF 0.0,R2 ;Initialize R2 
MPYF3 *ARO++(1),*AR1++(1),RO jm(i,0) * v(0) -> RO 
NOP 
* FOR (j = 1; 3 < N; j++) DO DOT PRODUCT OVER COLUMNS 
* 
DOT MPYF3  *ARO++(1),*AR1++(1),RO sm(i,j) * v(j) -> RO 
|| ADDF3 RO, R2,R2 gm(i,j-1) * v(j-1) + 
7;R2 -—> R2 
* 
DBD AR3, ROWS ;counts the number of rows left 
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Matrix-Vector Multiplication 


+ + + 


ADDF RO,R2 ;last accumulate 
STF R2, *AR2++(1) ;result -—> p(i) 
NOP *— —AR1 (IRO) ;set AR1 to point to v(0) 


!!! DELAYED BRANCH HAPPENS HERE !!! 
RETURN SEQUENCE 

RETS ; return 
end 


-end 
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6.5 Fast Fourier Transforms (FFTs) 
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Fourier transforms are an important tool often used in digital signal processing 
systems. The transform converts information from the time domain to the fre- 
quency domain. The inverse Fourier transform converts information back to 
the time domain from the frequency domain. Implementation of Fourier trans- 
forms that are computationally efficient are known as fast Fourier transforms 
(FFTs). The theory of FFTs can be found in books such as DFT/FFT and Con- 
volution Algorithms by C.S. Burrus and T.W. Parks (John Wiley, 1985) and Dig- 
ital Signal Processing Applications With the TMS320 Family. 


’C4x features that increase efficient implementation of numerically intensive 
algorithms are particularly well-suited for FFTs. The high speed of the ’'C4x 
(40-ns cycle time) makes the implementation of real-time algorithms easier, 
while the floating-point capability eliminates the problems associated with dy- 
namic range. The powerful indexing scheme in indirect addressing facilitates 
the access of FFT butterfly legs that have different spans. The repeat block 
implemented by the RPTB or RPTBD instruction reduces the looping over- 
head in algorithms heavily dependent on loops (such as the FFTs). This gives 
the efficiency of in-line coding with the form of a loop. Since the output of the 
FFT is in scrambled (bit-reversed) order when the input is in regular order, it 
must be restored to the proper order. This rearrangement does not require ex- 
tra cycles. The device has a special form of indirect addressing (bit-reversed 
addressing mode) that can be used when the FFT output is needed. 


The ’'C4x can implement the bit-reversed addressing mode on either the CPU 
or DMA. This mode makes it possible to access the FFT output in the proper 
order. If the DMA transfer with bit-reversed addressing mode is used, there is 
no overhead for data input and output. 


There are several types of FFT examples in this section: 


[J Radix-2 and radix-4 algorithms, depending on the size of the FFT 
butterfly 


._j Decimation in time or frequency (DIT or DIF) 

.) Complex or real FFTs 

Lj FFTs of different lengths, etc. 

The following C-callable FFT code examples are provided in this section: 
.) Complex radix-2 DIF FFT: subsection 6.5.1 

J Complex radix-4 DIF FFT: subsection 6.5.2 

.) Faster Complex radix-2 DIT FFT: subsection 6.5.3 

j Real radix-2 DIF FFT: subsection 6.5.4 


Fast Fourier Transforms (FFTs) 


Code for these different FFTs can be found in the DSP Bulletin Board Service 
(under the filename: C40FFT.EXE). This file includes code, input data and sine 
table examples, and batch files for compiling and linking. For instructions on 
how to access the BBS, see subsection 10.1.3, The Bulletin Board Service 
(BBS). To use these FFT codes, you need to perform two steps: 


(1 Provide a sine table in the format required by the program. This sine table 
is FFT size specific, with the exception of the sine table required for 
Complex radix-2 DIT and the real radix-2 DIF FFT programs (as noted in 
Example 6-18) 


Align the input data buffer on an+1 memory boundary, i.e the n+1 LSBs 
of the input buffer base address must be zero. (n = log FFT_SIZE). 


For most applications, the ’C4x quickly executes FFT lengths of up to 1024 
points (complex) or 2048 points (real) because it can do so almost entirely in 
on-chip memory. 


For FFTs larger than 1024 (complex), see the application report, Parallel 1-D 
FFT Implementation with the TMS320C4x DSPs, in the book Parallel Proces- 
sing Applications with the TMS320C4x DSP (literature number SPRAO31). 
This application note covers unprocessed partitioned FFT implementation for 
large FFTs. The source code is also available on the TI DSP Bulletin Board (un- 
der the filename: C40PFFT.EXE). 
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6.5.1 Complex Radix-2 DIF FFT 


Example 6—12 shows a simple implementation of a complex radix-2, DIF FFT 
on the ’C4x. The code is generic and can be used with any length number. 
However, for the complete implementation of an FFT, a table of twiddle factors 
(sines/cosines) is needed, and this table depends on the size of the transform. 
To retain the generic form of Example 6—12, the table with the twiddle factors 
(containing 1-1/4 complete cycles of a sine) is presented separately in 
Example 6—13 for the case of a 64-point FFT. A full cycle of a sine should have 
a number of points equal to the FFT size. If the table with the twiddle factors 
and the FFT code are kept in separate files, they should be connected at link 
time. 
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KKKKKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK KKK KK KKK KK 


* 

* FILENAME : CR2DIF.ASM 

* DESCRIPTION : COMPLEX, RADIX-2 DIF FFT FOR TMS320C40 (C callable) 

* DATE : 6/29/93 

* VERSION : 4.0 

* 

KEKE KKK KK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK K KKK KKEKKKKKK KKK KAKA KKK KKK 

* 

* VERSION DATE COMMENTS 

a ee eed ee a apa a 
1.0 10/87 PANNOS PAPAMICHALIS (TI Houston) Original Release 
2.0 iyecpe DANIEL CHEN (TI Houston): C40 porting 
3.0 7/1/92 ROSEMARIE PIEDRA (TI Houston): made it C-callable 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston): added support for 


in-place bit reversing 


KKK KKK KKK KKK KK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK 


SYNOPSIS: int cr2dif (SOURCE_ADDR,FFT_SIZE, LOGFFT,DST_ADDR) 
ar2 r2 £3 rc 
float *SOURCE_ADDR ; input address 
int FFT_SIZE 764, 128, 256, 512, 1024, 
int LOGFFT ;log (base 2) of FFT_SIZE 
float *DST_ADDR ;destination address 
- The computation is done in-place. 
—- Sections to be allocated in linker command file: .ffttxt : FFT code 
.fftdat : FFT data 


If SOURCE_ADDR=DST_ADDR, then in-place bit reversing is performed 


KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 


DESCRIPTION: 


Generic program for a radix-2 DIF FFT computation using the TMS320C4x family. 
The computation is done in-place and the result is bit-reversed. The program 
is from the Burrus and Parks book, p. 111. The input data array is 2*FFT_SIZE- 
long with real and imaginary data in consecutive memory locations: Re-Im-Re-Im 


The twiddle factors are supplied in a table put in a section with a global 
label _SINE pointing to the beginning of the table. This data is included ina 
separate file to preserve the generic nature of the program. The sine table 


size is (5*FFT_SIZE) /4. 


Note: Sections needed in the linker command file: .ffttxt : FFT code 
.fftdat : FFT data 


+ + + + + FF + FF FF FF FF FFF FF FF FF FF FF FF + FF + HF HF FH 
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Example 6-12. Complex Radix-2 DIF FFT (Continued) 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KA KKK KKK KKK KK 


7 + 

* AR + j AI AR’ + 3 AI’ 

* \ / + 

* \ / 

* 7, 

* a 

* / \ 

* / \ + 

* BR + 4 BI cos - j SIN ---- BR’ + 3 BI’ 

* = 

* 

* AR’= AR + BR 

iad AI’= AI + BI 

* BR’= (AR-BR) *COS + (AI-BI) *SIN 

* BI’= (AI-BI)*COS — (AR-BR) *SIN 

* 

KKKKKKKKKKKKKKKKKKKKKKKK KK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKKKKK 

* 

* 
-globl _SINE ;Address of sine/cosine tabl 
-globl _cer2dif ;Entry point for execution 
-globl STARTB, ENDB ;starting/ending point for benchmarks 
-sect MF oh mite Certs? 

SINTAB .word _SINE 


OUTPUTP .space 1 
FFTSIZE .space 1 
-sect UOEPECKE” 


—er2ditf: 
LDI SP, ARO 
PUSH DP 
PUSH R4 ;Save dedicated registers 
PUSH R5 
PUSH R6 j;lower 32 bits 
PUSHF R6 ;upper 32 bits 
PUSH AR4 
PUSH ARS 
PUSH AR6 
PUSH R8 
LDP SINTAB 
Preis .REGPARM == 0 ;stack is used for parameter passing 
LDI *-ARO (1) ,AR2 ;points input data 
LDI *—-ARO (2) ,R10 ;R1LO=N 
LDI *-ARO (3),R9 ;R9 holds the remain stage number 
LDI *-ARO (4),RC ;points where FFT result should move to 
-else ;registers are used for parameter passing 
LDI R2,R10 
LDI R3,R9 
endif 
STI RC, @OUTPUTP 
STL R10, @FFTSIZE 
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Example 6-12. Complex Radix-2 DIF FFT (Continued) 


STARTB: 
LDI 1,R8 ; Initialize repeat counter of first loop 
LSH3 1,R10, IRO ; IRO=2*N1 (because of real/imag) 
LSH3 —2,R10,IR1 ;IR1=N/4, pointer for SIN/COS table 
LDI 1,AR5 ;Initialize IE index (AR5=IE) 
LSH 1,R10 
SUBI3 1,R8,RC ;RC should be one less than desired # 
x Outer loop 
LOOP: 
RPTBD BLK1 ;Setup for first loop 
LSH —1,R10 ;N2=N2/2 
LDI AR2,ARO ;ARO points to X(T) 
ADDI R10, ARO, AR6 ;AR6 points to X(L) 
* 
* 
* First loop 
* 
ADDF *ARO, *ARG6, RO ; RO=X (I) +X (L) 
SUBF *AR6++, *ARO++,R1 ; R1L=X (I) -X (L) 
ADDF *AR6, *ARO, R2 ;R2=Y (I) +Y(L) 
SUBF *AR6, *ARO,R3 ; R3=Y (I) -Y (L) 
STF R2, *ARO-- ;Y(I)=R2  and.. 
I | STF R3, *AR6—-— 7; Y(L)=R3 
BLK1 STF RO, *ARO++ (IRO) ;X(I)=RO and... 
I] STF R1, *AR6++ (IRO) ;X(L)=R1 and ARO,2 = ARO,2 + 2*n 
* If this is the last stage, you are done 
SUBI 1,R9 
BZD ENDB 
* main inner loop 
LDI 2,AR1 ;Init loop counter for inner loop 
LDI @SINTAB, AR4 ;Initialize IA index (AR4=IA) 
ADDI AR5,AR4 ; IA=IA+IE;AR4 points to cosine 
ADDI AR2,AR1,ARO 7 (X(1),Y(1I)) pointer 
SUBI 1,R8,RC ;RC should be one less than desired # 
INLOP: 
RPTBD BLK2 ;Setup for second loop 
ADDI R10,ARO, AR6 7 (X(L),Y(L)) pointer 
ADDI 2,AR1 
LDF *AR4,R6 ; RO=SIN* 
* 
* Second loop 
* 
SUBF *AR6, *ARO, R2 ;R2=X (I) -X(L) 
SUBF *+AR6,*+ARO,R1 ;RL=Y¥ (I) -Y (L) 
MPYF R2,R6, RO ,;RO=R2*SIN and... 
lI ADDF *+AR6,*+AR0,R3 ;R3=Y (I) +Y(L) 
MPYF R1, *+AR4 (IR1),R3 ;R3 = R1 * COS and 
| STF R3, *+AR0 7Y (1) =¥ (1) +¥ (L) 
SUBF RO,R3,R4 ; R4=R1*COS-R2*SIN 
MPYF R1,R6, RO ;RO=R1*SIN and... 
|| ADDF *AR6, *ARO, R3 ; R3=X (I) +X (L) 
MPYF R2, *+AR4 (IR1),R3 ;R3 = R2 * COS and... 
I | STF R3, *ARO++ (IRO) ;X(1)=X(1I)+X(L) and ARO=AR0+2*N1 
ADDF RO,R3,R5 ; RS=R2*COS+R1*SIN 
BLK2 STF R5, *AR6++ (IRO) ;X(L)=R2*COS+R1*SIN, incr AR6 and... 
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= 
| 


PI 
RAF 
DI 
DI 


HEP WrnNPrPwWaAN 
NUANGUUA 
w 
H 


ENDB: 
* 
* 


cmpi 
beqd 
nop 
ain 
subi 


INPLACE 


* This bit-reversal 
KKEKKKKKKKKK KKK KKK KK 


R4, *+AR6 
R10,AR1 
INLOP 
AR5,ARA 
AR2,AR1, ARO 
1,R8,RC 
1,R8 
LOOP 
1,AR5 
R10, IRO 
1,R8,RC 


7Y (L) =R1L*COS-R2*SIN 


;Loop back to the inner loop 
; IA=IA+IE;AR4 points to cosine 
7 (X(1),Y(1)) pointer 


;Increment loop counter for next time 
;Next FFT stage (delayed) 

; [E=2*1E 
;N1=N2 


BITREVERSAL 


@OUTPUTP, ar2 
INPLACE 


@FFTSIZE,ixr0 
21rd, EC 


BITRV 

2, LE 
@OUTPUTP, arl 
*tar2(1),r0 
*ar2t++(ir0)b,rl1 
r0, *tarl (1) 
*tar2(1),r0 

rl, *arlé++ (irl) 
END 
*ar2++(ir0)b,r1 
r0O, *+tarl1 (1) 


el, *arl 


BITRV2 
ar2,arl 
*++arl (2) 
*ar2++(ir0)b 
arl,ar2 
CONT 

Far £0 
kare, £1 

©r0), *arZ2 

CL, Xan 
*+tarl(1),r0 
*tar2(1),r1 


j;irO = FFT_SIZE 
re = FFT_SIZE-2 
;SRC different from DST 


;ar2 = SRC_ADDR 


jirl = 2 
;arl = DST_ADDR 
;read first Im value 


jin place bit reversing 


KKK KKK KKK KKK KR KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK 


* 


section assume input and output in Re-Im-Re-Im format * 
KKK KK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKAKK KKK KAKA KK KK KK 
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stf r0, *tar2 (1) 
| | StL r1,*+tarl (1) 
CONT nop *++tarl (2) 
BITRV2 nop *ar2++(ir0)b 


, 
;Return to C environment. 


, 


END: POP R8& 
POP AR6 ;Restore the register values and return 
POP AR5 
POP AR4 
POPF R6 
POP R6 
POP R5 
POP R4 
POP DP 
RETS 
.end 


Applications-Oriented Operations 6-31 


Fast Fourier Transforms (FFTs) 


Example 6-13. Table With Twiddle Factors for a 64-Point FFT 


KEK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KEK KK KKK KKK KKKKKKKK KK KAKA KK KK 
* 
* TITLE TABLE WITH TWIDDLE FACTORS FOR A 64-POINT FFT 
* 
* FILE TO BE LINKED WITH THE SOURCE CODE FOR A 64-POINT, 
* RADIX-2 DIF COMPLEX FFT OR A RADIX-4 DIF COMPLEX FFT. 
* 
* SINE TABLE LENGTH = 5*FFTSIZE/4 
* 
KEK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKAKKKAKK KK KKK KKK KK 

globl _SINE 

-sect " sintab” 
_SINE 

- float 0.000000 

-float 0.098017 

.float 0.195090 

- float 0.290285 

.float 0.382683 

-float 0.471397 

- float 0.555570 

-float 0.634393 

-float 0.707107 

- float 0.773010 

- float 0.831470 

. float 0.881921 

-float 0.923880 

.float 0.956940 

.float 0.980785 

- float 0.995185 
_COSINE 

.float 1.000000 

.float 0.995185 

- float 0.980785 

sEloat 0.956940 

- float 0.923880 

-float 0.881921 

- float 0.831470 

.float 0.773010 

-float 0.707107 

. float 0.634393 

.float 0.555570 

-float 0.471397 

- float 0.382683 

-float 0.290285 

. float 0.195090 

-float 0.098017 

-float 0.000000 

.float -0.098017 

-float -0.195090 

- float -0.290285 

-float -—0.382683 

-float -0.471397 

. float -0.555570 

.float -0.634393 

-float -0.707107 

- float -0.773010 

- float -0.831470 

.float -0.881921 

-float -—0.923880 
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. float -0.956940 
.float -0.980785 
.float -0.995135 
. float -—1.000000 
. float =0...995135 
.float -0.980785 
. float -—0.956940 
. float -—0.923880 
. float -—0.881921 
. float —0.831470 
. float -0.773010 
.float -0.707107 
. float -—0.634393 
. float =0.555570 
.float -0.471397 
. float -—0.382683 
. float -—0.290285 
. float -—0.195090 
. float -0.098017 
. float 0.000000 
. float 0.098017 
. float 0.195090 
. float 0.290285 
. float 0.382683 
.float 0.471397 
. float 0.555570 
. float 0.634393 
.float 0.707107 
.float 0.773010 
. float 0.831470 
. float 0.881921 
. float 0.923880 
. float 0.956940 
. float 0.980785 
.float 0.995185 


6.5.2 Complex Radix-4 DIF FFT 


The radix-2 algorithm has tutorial value because it is relatively easy to under- 
stand how the FFT algorithm functions. However, radix-4 implementations can 
increase the speed of the execution by reducing the overall arithmetic re- 
quired. Example 6-14 shows the generic implementation of a complex, DIF 
FFT in radix-4. A companion table like the one Example 6—13 should be used 
to provide the twiddle factor. 
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KKK KK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


* 
* FILENAME : CR4DIF.ASM 
* DESCRIPTION : COMPLEX, RADIX-4 DIF FFT FOR TMS320C40 (C callable) 
* DATE : 6/29/93 
* VERSION : 4.0 
* 
KEK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKKAKK KKK KK KKK 
* 
* VERSION DATE COMMENTS 
: —— ee Se Se 
1.0 10/87 PANNOS PAPAMICHALIS (TI Houston) 
Original Release 
220 1/91 DANIEL CHEN (TI Houston): C40 porting 
3.0 7/1/91 ROSEMARIE PIEDRA (TI Houston): made it C-callable 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston) :added support for 
in-place bit reversing. 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KKK 


SYNOPSIS: int cr4dif (SOURCE_ADDR,FFT_SIZE, LOGFFT,DST_ADDR) 
ar2 r2 r3 re 
float *SOURCE_ADDR ;input address 
int FFT_SIZE 764, 256, 1024, cs. 
int LOGFET ;log (base 4) of FFT_SIZE 
float *DST_ADDR ;destination address 
— The computation is done in-place. 
— Sections to be allocated in linker command file: .ffttxt : FFT code 
-fftdat : FFT data 


If SOURCE_ADDR=DST_ADDR, then in-place bit reversing is performed 


KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK 


iw) 


ESCRIPTION: 


Generic program for a radix-4 DIF FFT computation using the TMS320C4x 
family. The computation is done in-place and the result is bit-reversed. 
The program is taken from the Burrus and Parks book, p. 117. 

The input data array is 2*FFT_SIZE-long with real and imaginary data 

in consecutive memory locations: Re-Im-Re-Im 


[The twiddle factors are supplied in a table put in a section 
with a global label _SINE pointing to the beginning of the table 
This data is included in a separate file to preserve the generic 
nature of the program. The sine table size is (5*FFT_SIZE) /4. 


In order to have the final results in bit-reversed order, the two 
middle branches of the radix-4 butterfly are interchanged during 
storage. Note the difference when comparing with the program in p.117 
of the Burrus and Parks’ book. 


+ + FF FFF FF F FF FF FF F FF FF HF FF HF FF HF FF FF FF FF F FF FF FH 
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* Note: Sections needed in the linker command file: .ffttxt FFT code 
* .fftdat FFT data 
* 
KEKE KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKK KKK KAKA KKK KKK 
* 
* WARNING: 
* 
* For optimization purposes, LDF *+AR1,RO (see **1**) will fetch memory outside 
* the input buffer range during the "first loop” execution (RC=0). Even though 
* the read value (RO) is not used in the code, this could cause a halt situa 
* tion if AR1 points to a no-ready external memory 
* 
KKEKKK KKK KK KKK KKK KK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKEKKAKKKK KK KAKA KKK KKK 
-globl _ SINE ;Address of sine/cosine table 
-globl _cr4dif ;Entry point for execution 
-globl STARTB, ENDB ;starting/ending point for benchmarks 
-sect " f£ftdat” 
FFTSIZ .space 1 
SINTAB .word _SINE 
SINTAB1 .word SINE-1 
INPUTP .space 1 
OUTPUTP space 1 
-sect W fECERE” 
_cr4dif: 
LDI SP, ARO 
PUSH DP 
PUSH R4 ;Save dedicated registers 
PUSH R5 
PUSH R6 ;lower 32 bits 
PUSHF R6 ;upper 32 bits 
PUSH R7 j;lower 32 bits 
PUSHF R7 ;upper 32 bits 
PUSH AR3 
PUSH AR4 
PUSH AR5 
PUSH AR6 
PUSH AR7 
PUSH R8 
if .-REGPARM == 
LDI *-ARO (1),AR2 ;points to input data 
LDI *-ARO (2),R10 ;RLO=N 
LDI *-ARO (3),R9 ;R9 holds the remain stage number 
LDI *-ARO (4) ,RC ;points to where FFT result should move to 
.else 
LDI R2,R10 
LDI R3,R9 
.endif 
LDP FFTSIZ ;Command to load data page pointer 
STI AR2, @INPUTP 
STI RC, @OUTPUTP 
STI R10, @FFTSIZ 


Applications-Oriented Operations 6-35 


Fast Fourier Transforms (FFTs) 


Example 6-14. Complex Radix-4 DIF FFT (Continued) 


STARTB: 
LDI @GFFTSIZ, BK 
LSH3 1,BK, IRO ;IRO=2*N1 (because of real/imag) 
LSH3 —-2,BK,IR1 ;IR1=N/4, pointer for SIN/COS table 
LDI 1,AR7 ;Initialize IE index 
LDI 1,R8 ; Initialize repeat counter of first loop 
ADDI 2,IR1,R9 ;RO=JIT 
LSH -1,BK ; BK=N2 
* OUTER LOOP 
LOOP: LDI @INPUTP, ARO ;ARO points to X(1) 
SUBI3 1,R8,RC ;RC should be one less than desired # 
ADDI BK, ARO, AR1 ;AR1 points to X(I1) 
RPTBD BLK1 ;Setup loop BLK1 
ADDI BK, AR1, AR2 ;AR2 points to X(I2) 
ADDI BK, AR2, AR3 ;AR3 points to X(I3) 
LDF *+AR1,RO ;RO=Y(I1) 
* FIRST LOOP: BLK1 
ADDF RO, *+AR3,R3;R3=Y (I1)+Y (13) 
ADDF *+ARO, *+AR2,R1 ;R1=Y (I) +Y (12) 
ADDF R3,R1,R6 ; R6=R1+R3 
SUBF *+AR2,*+ARO,R4 ; R4=Y (I) -Y (12) 
LDF *AR2,R5 ;R5=X (12) 
STF R6, *+ARO ;Y (I) =R1+R3 
SUBF R3,R1 ;R1=R1-R3 
ADDF *AR3,*AR1,R3 ;R3=X (I1) +X (13) 
ADDF R5,*ARO,R1 ;R1=X (I) +X (12) 
STF R1,*+AR ;Y(I1)=R1-R3 
ADDF R3,R1,R6 ; R6O=R1+R3 
SUBF R5, *ARO, R2 ;R2=X (I) -X (12) 
STF R6, *ARO++ (IRO) ;X (1) =R1+R3 
SUBF R3,R1 ;R1=R1-R3 
SUBF *AR3,*AR1,R6 ;R6=X (I1) -X (13) 
SUBF RO, *+AR3,R3 ; -R3=Y (I1)-Y(I3) 
STF R1, *AR1++ (IRO) ;X(I1)=R1-R3 
SUBF R6,R4,R5 ; R5=R4-R6 
ADDF R6,R4 ;R4=R4+R6 
STF R5,*+AR2 7 Y(I2)=R4-R6 
STF R4, *+AR3 7 Y (13) =R4+R6 
SUBF R3,R2,R5 ;R5=R2+R3 
ADDF R3,R2 ; R2=R2-R3 
STF R2, *AR3++ (IRO) 7X (13) =R2+R3 
BLK1 STF R5, *AR2++(IRO) ;X (12) =R2-R3 
| LDF *+AR1,RO ; RO=Y (I1) ; eee 
* IF THIS IS THE LAST STAGE, YOU ARE DONE 
CMP I IR1,R8 
BZD ENDB 
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Example 6-14. Complex Radix-4 DIF FFT (Continued) 


* MAIN INNER LOOP 
* 
LDI 1,R10 
LDI 2,R11 
LDI R11, ARO 
ADDI @INPUTP, ARO 
ADDI 2,R11 
INLOP: ADDI AR7,R10 
ADDI BK, ARO, AR1 
CMP I R9,R11 
BZD SPCL 
ADDI BK, AR1, AR2 
ADDI BK, AR2,AR3 
SUBI3 1,R8,RC 
LDI R10,AR4 
ADDI @SINTAB1, AR4 
ADDI AR4,R10,AR5 
SUBI 1,AR5 
RPTBD BLK2 
ADDI R10, AR5,AR6 
SUBI 1,AR6 
LDF *+AR2,R7 
* 
* SECOND LOOP: BLK2 
* 
ADDF R7, *+ARO, R3 
ADDF *+AR3,*+AR1,R5 
ADDF R5,R3,R6 
SUBF R7, *+ARO,R4 
SUBF R5,R3 
ADDF *AR2,*ARO,R1 
ADDF *AR3, *AR1,R5 
MPYF R3, *+AR5 (IR1),R6 
| STF R6, *+ARO 
ADDF R5,R1,R0 
SUBF *AR2, *ARO,R2 
SUBF R5,R1 
MPYF R1, *AR5,RO 
\ | STF RO, *ARO++ (IRO) 
SUBF RO, R6 
SUBF *+AR3, *+AR1,R5 
MPYF R1, *+AR5(IR1),RO 
I] STF R6, *+AR1 
MPYF R3, *AR5,R6 
ADDF RO,R6 
ADDF R5,R2,R1 
SUBF R5,R2 
SUBF *AR3, *AR1,R5 
SUBF R5,R4,R3 
ADDF R5,R4 
MPYF R3, *+AR4 (IR1),R6 
1 | STF R6, *AR1++ (IRO) 
MPYF R1, *AR4,R0 
SUBF RO, R6 
MPYF R1, *+AR4 (IR1),R6 
[| STF R6, *+AR2 


;Init IAl index 
;Init loop counter for inner loop 


7 (X(1),Y(1I)) pointer 
;Increment inner loop counter 
; IAL=IA1+IE 
7 (X(11),Y(1I1)) pointer 

;If LPCNT=JT, go to 

;special butterfly 

7 (X(12),Y(12)) pointer 

; (X(13),Y(13)) pointer 

;RC should be one less than desired # 


;Create cosine index AR4 
; IA2=IA1+IA1-1 


7Setup loop BLK2 


; IA3=IA2+IA1-1 
;R7=Y (12) 


R3=Y (I) +¥ (12) 
R5=Y (I1) +¥ (13) 
R6=R3+R5 

R4=Y (I) -Y (12) 
R3=R3-R5 

R1=X (I) +X (12) 
R5=X (I1) +X (13) 
R6=R3*CO2 

Y (I) =R3+R5 
RO=R1+R5 

R2=X 
R1=R1-R5 
RO=R1*S1I2 
X(I)=R1+R5 
R6=R3*CO2-R1*S12 
R5=Y (I1)-Y (13) 
RO=R1*CO02 

Y (I1) =R3*CO2-R1*SI2 
R6=R3*S12 
R6=R1*CO2+R3*S12 
R1=R2+R5 

R2=R2-R5 

R5=X (I1) -X (13) 
R3=R4-R5 

R4=R4+R5 

R6=R3*CO1 
X(I1)=R1*CO2+R3*SI2 
RO=R1*SI1 
R6=R3*CO1L+R1*SIL 
R6=R1*CO1 

;Y (12) =R3*CO1-R1*SI1 


Se ee re re 
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Example 6-14. Complex Radix-4 DIF FFT (Continued) 


PYF R3, *AR4, RO 
ADDF RO, R6 
PYF R4,*+AR6(IR1),R6 
STF R6, *AR2++ (IRO) 
PYF R2,*AR6,RO 
SUBF RO, R6 
PYF R2,*+AR6(IR1),R6 
| STF R6, *+AR3 
PYF R4,*AR6, RO 
ADDF RO, R6 
BLK2 STF R6, *AR3++ (IRO) 
| LDF *+AR2,R7 
CMP I R11,BK 
BPD INLOP 
LDI R11,AR0 
ADDI @INPUTP, ARO 
ADDI 2,R11 
BRD CONT 
LSH 2,R8 
LSH 2,AR7 
LDI BK, IRO 
* SPECIAL BUTTERFLY FOR W=Jd 
SPCL RPTBD BLK3 
LSH -1,IR1,AR4 
ADDI @SINTAB, AR4 
LDF *BR2,R7 
* SPCL LOOP: BLK3 
ADDF R7,*ARO,R1 
ADDF *+AR2,*+ARO,R3 
SUBF *+AR2,*+ARO,R4 
ADDF *AR3, *AR1,R5 
SUBF R1,R5,R6 
ADDF R5,R1 
ADDF *+AR3,*+AR1,R5 
SUBF R5,R3,R0 
ADDF R5,R3 
SUBF R7, *ARO, R2 
|| STF R3,*+ARO 
LDF *BR3,R7 
|| STF R1, *ARO++(IRO) 
SUBF *+AR3,*+AR1,R3 
SUBF R7,*AR1,R1 
|| STF R6,*+AR 
ADDF R3,R2,R5 
SUBF R2,R3,R2 
SUBF R1,R4,R3 
ADDF R1,R4 
SUBF R5,R3,R1 
MPYF R1,*AR4,R1 
|| STF RO, *AR1++(IRO) 
ADDF R5,R3 
MPYF R3, *AR4,R3 
|| STF R1,*+AR2 
SUBF R4,R2,R1 
MPYF R1,*AR4,R1 


;RO=R3*SI1 
;RO=R1*CO1L+R3*SI1 

; R6O=R4*CO3 

7X (12) =R1*CO1+R3*SI1 
;RO=R2*SI3 

; R6=R1*CO3-R2*SI3 

; R6O=R2*CO3 

;Y (13) =R4*CO3-R2*SI3 
;RO=R4*SI3 

; R6O=R2*CO3+R4*S13 

7X (13) =R2*CO3+R4*SI3 
;Load next Y(1I2) 


; LOOP BACK TO THE 


INNER LOOP 


7; (X(I),Y(I)) pointer 
;Increment inner loop counter 


;Increment repeat counter for next time 
; [E=4*1E 
;N1=N2 


;Setup loop BLK3 
;Point to SIN(45) 


;Create cosine index AR4=CO21 
;R7=X (12) 
;R1=X (I) +X (12) 
;R3=Y (I) +Y (12) 
;R4=Y (I) -Y (12) 
;R5=X (I1) +X (13) 
;RO=R5-R 
;R1=R1+R5 
;RSO=Y (11) +Y (13) 
; RO=R3-R5 
;R3=R3+R5 


;R2=X (I) -X (12) 
;Y (I) =R3+R5 
;R7=X (13) 
;X (I) =R1+R5 
;R3=Y (11) -Y (13) 
;R1=X (I1) -X (13) 
;Y(I1)=R5-R1 

; R5=R2+R3 
;R2=-R2+R3 
;R3=R4-R1 

;R4=R4+R1 

;R1=R3-R5 
;R1=R1*CO21 
;X(I1)=R3-R5 
;R3=R3+R5 

; R3=R3*CO21 

;Y (12) =(R3-R5) *CO21 


;R1l= 
;R1l= 


R2-R4 
R1*CO21 
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Example 6—14. Complex Radix-4 DIF FFT (Continued) 


\ | STF R3,*AR2++(IRO) ;X(1I2)=(R3+R5) *CO21 
ADDF R4,R2 ;R2=R2+R4 
MPYF3 R2, *AR4,R2 ;R2=R2*CO21 
I] STF R1, *+AR3 7 Y (13) =- (R4-R2) *CO21 
BLK3 LDF *AR2,R7 ;Load next X(I2) 
\ | STF R2,*AR3++(IRO) ;X(13)=(R4+R2) *CO21 
CMP I R11,BK 
BPD INLOP ;Loop back to the inner loop 
LDI R11,ARO 
ADDI @INPUTP, ARO 7 (X(I),Y(1I)) pointer 
ADDI 2,R11 ;Increment inner loop counter 
LSH 2,R8 ;Increment repeat counter for next time 
LSH 2,ART7 ; L[E=4*1E 
LDI BK, IRO j;N1=N2 
CONT BRD LOOP ;Next FFT stage (delayed) 
LSH -2,BK j;N2=N2/4 
LSH3 -1,BK,R9 
ADDI 2,R9 ; JT=N2/2+2 
ENDB: 
KEK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKKKKKKKK KKK KK KKK KKK 
#------------- BIT REVERSAL - 


* This bit-reversal section assumes input and output in Re-Im-Re-Im format * 
KEK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KE KKK KKK KKK KKKKKKKKAK KKK KK KKK KKK 


LDI @INPUTP, arO 
CMP I @OUTPUTP, arO 
BEQD INPLACE 
LDI @OUTPUTP, arl ;arl=DST_ADDR 
LDI @FFTSIZ,ir0 ; ir0=FFT_SIZE 
SUBI 27605 Fe ; CC=FFT_SIZE-2 
RPTBD bitrvl 
LDI 2,irl jirl=2 
LDF *+tar0(1),r0 ;read first Im value 
OP LDF *ar04++(ir0)b,r1 
| | STF r0, *tarl1 (1) 
bitrvl LDF *+tar0(1),r0 
| STF rl,*arl++(irl) BUD END 
LDF *ar0t++(ir0)b,rl1 
|| STF r0, *+arl1 (1) 
NOP 
STF rl, *arlINPLACE 
RPTBD BITRV2 
NOP *++arl1 (2) 
OP *ar0t++(ir0)b 
NOP CMP I arl,ar0O 
BGEAT CONT2 
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Example 6-14. Complex Radix-4 DIF FFT (Continued) 


LDF *ar1,r0 
|| LDF *aro,, ri 

STF r0O, *ar0 
|| OLE rl1,*arl 

LDF *tarl(1),r0 
|| LDF *tar0(1),r1 

STF r0, *+ar0 (1) 
|| STF r1,*t+arl1 (1) 
CONT2 OP *++arl (2) 
BITRV2 OP *ar0t++(ird)b 
END: POP R8 ;Restore the register values and return 

POP AR7 

POP AR6 

POP AR5 

POP AR4 

POP AR3 

POPF R7 

POP R7 

POPF R6 

POP R6 

POP R5 

POP R4 

POP DP 

RETS 

end 
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6.5.3 Faster Complex Radix-2 DIT FFT 


Example 6-12 and Example 6-14 provide an easy understanding of the FFT 
algorithm functions. However, those examples are not optimized for fast ex- 
ecution of the FFT. Example 6—15 shows a faster version of a radix-2 DIT FFT 
algorithm. This program uses a different twiddle factor table than the previous 
examples. The twiddle factors are stored in bit-reversed order and with a table 
length of N/2 (N = FFT length) as shown in Example 6—16. For instance, if the 
FFT length is 32, the twiddle factor table should be: 


Address Coefficient 

0 R{WN(0)} = COS(2*PI*0/32) = 1 

1 -I{WN(0)} = ~~ SIN(2*PI*0/32) = 0 

2 R{WN(4)} =  COS(2*PI*4/32) = 0.707 
3 -I{WN(4)} = ~~ SIN(2*P1*4/32) = 0.707 
12 R{WN(3)} = COS(2*PI*3/32) = 0.831 
13 -I{WN(3)} = ~~ SIN(2*P1*3/32) = 0.556 
14 R{WN(7)} =  COS(2*PI*7/32) = 0.195 
15 -I{WN(7)} = ~~ SIN(2*PI*7/32) = 0.981 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT 


KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KK 


FILENAME : CR2DIT.ASM 


DESCRIPTION : COMPLEX, RADIX-2 DIT FFT FOR TMS320C40 


DATE : 6/29/93 


VERSION : 4.0 


KKK KKK KKK KKK KKK KKK KKK KK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK 


VERSION DATE COMMENTS 
1.20 7/89 Original version 
RAIMUND MEYER, KARL SCHWARZ 
LEHRSTUHL FUER NACHRICHTENTECHNIK 
UNIVERSITAET ERLANGEN-NUERNBERG 
CAUERSTRASSE 7, D-8520 ERLANGEN, FRG 
2.0 1/91 DANIEL CHEN (TI HOUSTON): C40 porting 
30 7/1/92 ROSEMARIE PIEDRA (TI HOUSTON): made it 


C-callable and implemented changes in the order 

of the operands for some mpyf instructions for 

faster execution when sine table is off-chip 
4.0 6/29/93 ROSEMARIE PIEDRA (TI Houston): Added support 


for in-place bit reversing. 
KKKKKKKKKKKKKKKKKKKKKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKKKKKKK 


SYNOPSIS: int cr2dit (SOURCE_ADDR,FFT_SIZE, DST_ADDR) 


ar2 Be r3 
float * SOURCE_ADDR ; Points to where data is originated 
; and operated on. 
int FFT_SIZE ; ©4, 128, 256, 512, 1024, 
float *DST_ADDR ; Points to where FFT results should be 
7 moved 


KKKKKK KKK KKK KKK KKK KKK KKK KK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK 


THE COMPUTATION IS DONE IN-PLACE. 
FOR THIS PROGRAM THE INIMUM FFT LENGTH IS 32 POINTS BECAUSE OF THE 
SEPARATE STAGES (THIS IS NOT CHECKED INSIDE THE 
FIRS WO PASSES ARE REALIZED AS A FOUR BUTTERFLY LOOP SINCE THE 
MULTIPLIES ARE TRIVIAL. THE MULTIPLIER IS ONLY USED FOR A LOAD IN 
PARALLEL WITH AN ADDF OR SUBF. 
KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KK KKK 
SECTIONS NEEDED IN LINKER COMMAND FILE: .ffttxt : fft code 
-fftdat : fft data 
KEKE KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKKKKKKK KKK KAKA KK KKK 
THE TWIDDLE FACTORS ARE STORED IN BIT-REVERSED ORDER AND WITH A TABLE LENGTH 


OF N/2 (N = FFTLENGTH). THE SINE TABLE IS PROVIDED IN A SEPARATE FILE 
WITH GLOBAL LABEL _SINE POINTING TO THE BEGINNING OF THE TABLE. 


+ + FF FF FF FF FF F FF F FF FF FF HF FF HF FF HF FF FFF FFF FF FF FF FF FF F FF F HF HF F 


6-42 


Fast Fourier Transforms (FFTs) 


Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


* 
* 


AR 


BR 


pill 

AR’ 
AI’ 
BR’ 
BI’ 


+ + + + FF + FF + FF FF FF FF FF FF + FF + FF FF FF F FF FF F HF FH 


+ 


KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KK KKK 


TR = 


EXAMPLE: SHOWN FOR N=32, WN(n) = COS(2*PI*n/N) -— j*SIN(2*PI*n/N) 
ADDRESS COEFFICIENT 
0 R{WN(0)} = COS(2*PI*0/32) = 1 
1 -I{WN(0)} = SIN(2*PI*0/32) = 0 
2 R{WN(4)} = COS(2*PI*4/32) = 0.707 
3 -I{WN(4)} = SIN(2*PI*4/32) = 0.707 
12 R{WN (3) = COS (2*PI*3/32) = 0.831 
13 -I {WN (3) = SIN(2*P1I*3/32) = 0.556 
14 R{WN(7)} = COS(2*PI*7/32) = 0.195 
15 -I{WN(7)} = SIN(2*PI*7/32) = 0.981 
WHEN GENERATED FOR A FFT LENGTH OF 1024, THE TABLE IS FOR ALL FFT 
LENGTH LESS OR EQUAL AVAILABLE. 
HE MISSING TWIDDLE FACTORS (WN(),WN(),....) ARE GENERATED BY USING 
HE SYMMETRY WN(N/4+n) = -j*WN(n). THIS CAN BE REALIZED VERY EASY, BY 
CHANGING REAL- AND IMAGINARY PART OF THE TWIDDLE FACTORS AND BY 
NEGATING THE NEW REAL PART. 


+ 
3 AI AR’ + 3 AI’ 
\ fe 
\ / 
\ / 
/ \ 
/ \ 
j \ + 
j BI ---- ( COS - j SIN ) BR’ + j BI’ 
—* 
BR * COS + BI * SIN 
BI * COS - BR * SIN 
AR + TR 
AI + TI 
AR - TR 
AI - TI 


k* 
-global _cr2dit ; Entry execution point. 
-global _SINE ; Sine table pointer 
-global STARTB, ENDB ; starting/ending point for given 
; benchmarks 
-sect " Fitdat” 
fg .space ; is FFT_SIZE 
fg2 .space ; is FFT_SIZE/2 
fg4m2 .space ; is FFT_SIZE/4 - 2 
fg8m2 .space i ; is FFT_SIZE/8 - 2 
sintab .word _ SINE ; pointer to sine table 
sintp2 .word _SINE+2 ; pointer to sine table +2 
inputp2 .space ; pointer to input +2 
inputp «Space ; pointer to source address 
outputp .space 7 pointer to dst address 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


, 
; Initialize C Function. 
, 
-sect Mod Pee 
_er2dit: LDI SP, ARO 
PUSH R4 
PUSH R5 
PUSH R6 
PUSHF R6 
PUSH R7 
PUSHF R7 
PUSH AR3 
PUSH AR4 
PUSH AR5 
PUSH AR6 
PUSH AR7 
PUSH DP 
ae .REGPARM == 0 j; arguments passed in stack 
LDI *-ARO (1),AR2 ; src address 
LDI *-ARO (2) ,R2 ; FFT size 
LDI *-ARO (3) ,R3 ; dst address 
-endif 
LDP fg ; Initialize DP pointer. 
STL R2,@fg ; fg = FFT_SIZE 
LSH -1,R2 ; R2 = FFT_SIZE/2 
STI AR2, @inputp ; inputp = SOURCE_ADDR 
ADDI 2,AR2,R0 
STI RO, @inputp2 ; inputp2= SOURCE_ADDR + 2 
STL R3, @outputp ; output = DST_ADDR 
STL R2,@fg2 ; fg2 = nhalb = (FFT_size/2) 
LSH -1,R2 
SUBI 2,R2,R0 
STI RO, @£g4m2 ; fg4m2 = NVIERT-2 : (FFT_SIZE 
LSH -1,R2 
SUBI 2,R2,RO0 
STI RO, @f£g8m2 
* arO : AR + AI 
* arl : BR + BI 
ad ar2 : CR + Ci + CR’ +-CI’ 
* ar3 : DR + DI 
* ar4 : AR’ + AI’ 
ad ar5 : BR’ + BI’ 
* ar6 : DR’ + DI’ 
5 ar7 : first twiddle factor = 1 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


STARTB: 

ldi @fg2,ir0d ; irO0 = n/2 = offset between SOURCE_ADDRs 

ldi @sintab, ar7 ; ar7 points to twiddle factor 1 

ldi ar2,ar0 ; arO points to AR 

addi ir0,ar0,arl ; arl points to BR 

addi ir0,arl,ar2 ; ar2 points to CR 

addi ir0,ar2,ar3 ; ar3 points to DR 

ldi ar0,ar4 ; ar4 points to AR’ 

ldi arl,ar5 ; arS points to BR’ 

ldi ar3,ar6 ; ar6 points to DR’ 

ldi 2,ir1 ; addressoffset 

lsh =17. 250 ; irO = n/4 = number of R4-butterflies 

subi 2,ir0,rce 
KEK KK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KAKKKKKK KKK KK KK 
* ------------ FIRST 2 STAGES AS RADIX-4 BUTTERFLY * 
KEK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKKKKKAKKKK KK KKK KKK KK 
fill pipeline 

addf *ar2,*ar0,r4 ; v4 = AR + CR 

subf *ar2,*ar0++,r5 ; v5 = AR - CR 

addf Far ars, e6 ; cr6 = DR + BR 

subf karltt+, *ar3++,r7 ; xc7 = DR - BR 

addf r6,r4,xr0 ; AR’ = r0 = r4 4+ r6 

mpyf kar], *arstt+, rl ; vl = DI, BR’ = r3 = r4 - r6 

subf r6,r4,xr3 

addf rl1,*arl,xr0 ; vO = BI + DI , AR’ = x0 

Stt rO, *ar4t++ 

subf Fl, *ael++, 21 ; vl = BI - DI , BR’ = x3 

stf r3, *ar5t+ 

addf sal ental ay, ; CR’ = r2 = r5 + rl 

mpyf ear], *tar2, rl ; vl = CI, DR’ = r3 = r5 - rl 

subf ¥r1l,r5, x3 

rptbd b1lk1l ; Setup for radix-—4 butterfly loop 

addf ril,*axr0;r2 ; v2 = AIT + CI , CR’ = x2 

stf r2,*ar2++ (irl) 

subf r1,*ar0O++,r6 ; v6 = AIT - CI , DR’ = x3 

stf r3,*ar6t++ 

addf r0,r2,r4 j AI’ = r4 = r2 + £0 
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Example 6—15. Faster Version Complex, Radix-2 DIT FFT (Continued) 


ro 


¥1 


r4 


paso) 


(Dat 
r6 


nau) 


AR’ 
rl 


ro 


rl 


CR’ 


; rl = 


r2 
r6 
AI’ 
BL 
Cr’ 
AI’ 


DI’ 
DIE 


* radix-4 butterfly loop 
* 
mpyf ear, Mare — =p EO 
subf £0; 42; E22 
mpyf *ar7,*arlt++,rl 
addf ey, £6,723 
addf r0,*ar0,r4 
ster r4, *ar4tt+ 
subf r0, *ar0++,4r5 
Stir r2, *arStt+ 
subf ET, EO; 27 
addf r1,*ar3,r6 
stf r7,*ar6t++ 
subf rl1,*ar3++,r7 
stf 63 *ar2++ 
addf r6,r4,r0 
mpyf ears, *arstt+prl 
subf r6,r4,xr3 
addf el, *arl, £0 
stf rO, *ar4t++ 
subf rl, *aris++,.e1 
stf r3,*ar5t+ 
addf rl pe, e2 
mpyf ears SALT, CL 
subf ely ro,%3 
addf rl1,*ar0,r2 
stf r2, *ar2++ (irl) 
subf r1,*ar0++, r6 
stf r3, *ar6o+t+ 
blk1l addf r0,r2,xr4 
* clear pipeline 
* 
subf £0, 42; 42 
addf ry, r6;xr3 
stf r4, *ar4 
| | stf r2,*ar5 
subf eV, LO, "7 
stf r7,*ar6é 
| | stf 3; *>-arzZ 
. Sosa Seee eS THIRD TO LAST-2 STAGE 


CR, (BI’ = r2 = r2 - x0) 
BR, (CI’ = r3 = r6 + r7) 
AR + CR, (AI’ = r4) 
AR —- CR, (BI’ = r2) 
= r7 = r6 - r7) 
DR + BR, (DI’ = r7) 
DR - BR, (CI’ = r3) 
= r0 = r4 + r6 
DI , BR’ = r3 = r4 - r6 
BI + DI , AR’ = r0 
BI — DI , BR’ = r3 
= r2 = 7r5 + rl 
= CI DR’ = r3 = r5 - rl 
AI + CI , CR’ = r2 
AI - CI , DR’ = 4x3 
= r4 = r2 + x0 
= r2 = r2 - x0 
=r3 = r6 + r7 
= r4 , BI’ = r2 
= r7 = r6- r7 
=r7, CI’ = r3 


KKK KK KKK KKK KR KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KK KKK KKK KKK KK KK KKK 


* 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KK 
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stufe 


ldi @fg2,irl 

subi 1,i4%0,ar5 

ldi 1,ar6 

ldi @sintab, ar7 

ldi O0,ar4 

ldi @inputp, ar0 

ldi ar0O,ar2 

addi ir0,ar0,ar3 

ldi ar3,arl 

lsh 1,ar6 

lsh -2,ar5 

lsh 1,ar5 

lsh -1,ir0 

lsh =1 7 dee 

addi 1 eae. 

ldft *karlt++,r6 

ldf *eary,£7 
pipeline 
= upper real butterfly input 
= lower real butterfly input 
= upper real butterfly output 
= lower real butterfly output 
imaginary part has to follow 

ldft ¥*++ar], £6 

mpyf tari), 66; rl 

addf ¥++ar4d,¢r0, 63 

mpyf sir la caller aro a @| 

ldi ar5,rc 

rptbd bflyl 

mpyf hae) =) air Ls, 20 

addf £0, 21, x43 

mpyf Har lee ey] 7 el 

subf r3,*ar0,r2 

addf *ar0++, "3,455 

stf ¥2,*ar3+t 


plea 

AR’ 
AI’ 
BR’ 
BI 


+ + + + + + FF HF 


TR = 


FIRST BUTTERFLY-TYPE: 


BR * COS + BI * SIN 
BR * SIN - BI * COS 


AR + TR 

AL = TE 

AR - TR 

Al +- TL 
loop bflyl 


Ne Ne Ne Ne Ne Ne we 


Nee Ne Ne 


Ne Ne Ne Ne 


pointer to twiddle factor 
group counter 


upper real butterfly output 

lower real butterfly output 

lower real butterfly input 

double group count 

half butterfly count 

clear LSB 

half step from upper to lower real part 


step from old imaginary to new 

real value 

dummy load, only for address update 
r7 = COS 


r6é = SIN 

rl = BI * SIN 

dummy addf for counter update 
r0 = BR * COS 


Setup for loop bflyl 
r3 = TR = r0O + rl, rO = BR * SIN 


rl BI * COS , r2 = AR - TR 


5 


AR + TR, BR’ = r2 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


mpyf *tarl,r6,r5 ; v5 = BI * SIN , (AR’ = r5) 
stf v5, *ar2tt+ 
subf el, 40,22 ; (v2 = TI = r0 - ri) 
mpyf *xarl,r7,xr0 ; rO = BR * COS , (r3 = AI + TI) 
addf r2, *ar0;,xrs 
subf r2,*ar0++, r4 ; (v4 = AI - TI, BI’ = £3) 
SCE r3, *ar3+t+ 
addf £0, 25;r3 ; v3 = TR = r0 + £5 
mpyf *arlt++,r6,xr0 ; rO = BR * SIN, r2 = AR - TR 
subf r3,*ar0,r2 
mpyf *xarltt+,r7,r1 ; vl = BI * COS , (AI’ = r4) 
stf r4, *ar2tt+ 
bflyl addf *arOt++,xr3,xr5 ; r5 = AR + TR, BR! = r2 
stf r2, *ar3t++ 
* switch over to next group 
subf rl, £0, r2 ; v2 = TI = r0 - ri 
addf r2,*ar0, £3 ; v3 = AI + TI , AR’ = £5 
|| stf r5, *ar2++ 
subf r2,*ar0t++(irl),r4 ; v4 = AI - TI, BI’ = x3 
| | stf r3;,.*ar3++t (irl) 
nop *karlt++(irl) ; address update 
mpyf karla =p re), 21 ; vl = BI * COS , AI’ = r4 
| | StE v4, *ar2++ (irl) 
mpyf *aril., £67.60 ; crO = BR * SIN 
ldi ar5,rc 
rptbd bfly2 ; Setup for loop bfly2 
mpyf *ar7t++,*arlt++,xr0 ; v3 = TR = rl - rO , r0O = BR * COS 
|| subf r0,;21,23 
mpyf *arlt++,r6,r1 ; vl = BI * SIN, r2 = AR - TR 
hl subf rs, *ar0,;r2 
addf *arOt++,r3,xr5 ; vrS = AR + TR, BR’ = r2 
| | stf r2,*ar3+t 
* SECOND BUTTERFLY-TYPE: 
* 
* TR = BI * COS — BR * SIN 
* TI = BI * SIN + BR * COS 
* AR’= AR + TR 
* AI’= AI TL 
* BR’= AR TR 
* BI’= AI + TI 
- loop bfly2 
mpyf *tarl ,-£7, £5 ; v5 = BI * COS , (AR’ = r5) 
StE 5, Sar2++ 
addf r1,r0,r2 ; (v2 = TI = r0 + rl) 
mpyf *x*arl,r6,r0 ; rO = BR * SIN , (r3 = AI + TI) 
addf r2, *ar0, x3 
subf r2,*ar0++, r4 ; (v4 = AI - TI, BI’ = r3) 
stf r3,*ar3+t+ 
subf ©r0,xr5, x3 ; TR = r3 = r5 - £0 
mpyf *arlt+t+,r7,x0 ; rO = BR * COS , r2 = AR - TR 
subf r3,*ar0,r2 
mpyf *arlt++,r6,r1 ; vl = BI * SIN , (AI’ = r4) 
stf v4, *ar2tt+ 
bfly2 addf *arOt++,r3,xr5 ; vrS = AR + TR, BR’ = x2 
StE r2, *ar3t++ 


* clear pipeline 
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Fast Fourier Transforms (FFTs) 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK 


@inputp, ar0 


* 


fill pipeline 


ale 


2.6 


addf 
addf 
stf 
cmpi 
bned 
subf 
stf 
ldf 
stf 
nop 


ldi 


ldi 
ldi 


butterfly: 
add 
sub 
add 
sub 
butterf 


Fr Fh FH FH 


bh 
K 


Fy FH FH 


nD 
ts 
Hh O- 

Fh 


nn 
ct oct 
Fh Fh 


rl,£0;, x2 


E27. *an0,.e 


r5,*ar2++ 


ar6,ar4 
gruppe 


r2,*ar0t++(irl),r4 
r3, *ar3++ (irl) 


etter], 47 


v4, *ar2++ (irl) 
*arlt+ (irl) 
* end of this butterflygroup 


4,ir0 
stufe 


0,ar4 


ar0,ar2 
arl,ar3 


5,ixr0 


w*0 
*ar0,*arl 


@sintab,ar7 


@inputp, ard 
KKEKKK KKK KK KKK KKK KK KKK KKK KEK KKK KKK KEK KKK KKK KKK KKK KKK KK KKK KKK KKKKKKKAKKAK KKK KKK KK KK 


ECOND LAST STAGE 


’ 
, 


’ 


’ 


, 


’ 


, 


r2 = 
r3 = AI +T 
AR’ = x5 


do following 3 instructions 
BI’ 


r4 = AI -T 
r7 = COS 
AI’ = r4 


branch here 


jump out after ld(n)-3 stage 


pointer to twiddle factor 


group count 


I 


I 


er 


TI = r0 + rl 


ir0,ar0,arl 
@sintp2,ar7 


@fg8m2,rce 


ly ee 


*arlt+t+, *ar0++, 13 


*ar0,*arl 


lL ,e0 


*arlt+, *ar0t++,r1 


w*0 


*ar0,*arl 


l,r6 


*arlt+, *ar0t++,r7 


*ar0,*arl 
*earltt+(ir 
r2,*ar2+4 
r3,*ar3+4 
r0O, *ar2+4 
rl, *ar3+4 
r6, *ar2+4 
r7,*ar3+4 
r4, *ar2+4 
r5,*ar3+4 


L,r4 
0), *ar0++(ir0),4r5 


upper output 


lower input 


lower output 


pointer to twiddle faktor 
distance between two groups 


AR’ = r2 = 
BR’ = r3 = 
AI’ = r0 = 
BI’ = rl = 
AR’ = r6 = 
BR! = r7 = 
AI’ = r4 
BI’ = r5 = 
(AR! = r2) 
(BR’ = r3) 
(AI’ = r0) 
(BI’ = rl) 
AR’ = r6 
BR’ = r7 
AI’ = r4 
BI’ = r5 


AR 
AR 
Al 
Al 


AR 
AR 


=) Ar 


Al 


BR 
BR 
BI 
BI 


BR 
BR 
BI 
BI 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


* 3. butterfly: w*M/4 
addf *arOt++,*+tarl,r5 ; AR’ = r5 = AR + BI 
subf *arl,*ar0,r4 ; AI’ = r4 = AI - BR 
addf *x*arlt++, *ar0--,r6 ; BI’ = r6 = AI + BR 
subf *arlt++,*ar0t++,r7 ; BR’ = r7 = AR - BI 

* 4, butterfly: w*M/4 
addf *tarl,*t+t+ar0,r3 ; AR’ = r3 = AR + BI 
ldf *=ar],r1 ; rl = 0 (for inner loop) 

|| ldf *arlt++,r0 ; rO = BR (for inner loop) 
rptbd bf2end ; Setup for loop bf2end 
subf karlt++(ir0),*ar0++,r2 ; BR’ = r2 = AR - BI 
stf r5,;*ar2++ ; (AR’ = 45) 

|| stf r7,*ar3t++ ; (BR’ = 47) 
stf r6, *ar3t+t+ ; (BI’ = r6) 

* 5. to M. butterfly: 

“ loop bf2end 
ldf *xarT7t++,r7 ; v7 = COS , ((AI’ = r4)) 
stir r4, *ar2tt+ 
ldf *x*ar7++,r6 ; v6 = SIN , (BR’ = r2) 
stf r2, *ar3t++ 
mpyf *tarl,r6,r5 ; v5 = BI * SIN , (AR’ = r3) 
Stet r3,*ar2tt+ 
addf El, rO, £2 ; (v2 = TI = r0 + ri) 
mpyf *arl,r7,xr0 ; vO = BR * COS , (r3 = AI + TI) 
addf r2,; *axr0;,xr3 
subf r2,*ar0t++(ir0),r4 ; (c4 = AI - TI, BI’ = 4x3) 
stf r3, *ar3++(ir0) 
addf 60,45, 43 ; v3 = TR = r0 + £5 
mpyf *arlt++,r6,xr0 ; rO = BR * SIN, r2 = AR - TR 
subf r3, *ar0, r2 
mpyf *arlt++, 27 ,£1 ; vl = BI * COS , (AI’ = r4) 
stf v4, *ar2++(ir0) 
addf *arOt++,r3,xr5 ; vrS = AR + TR, BR’ = x2 
stf r2, *ar3t++ 
mpyf *tarl,£6, 165 ; v5 = BI * SIN , (AR’ = r5) 
stf r5, *ar2tt+ 
subf rl,xr0,xr2 ; (v2 = TI = r0 - ri) 
mpyf *ari,xr7,xr0 > £0 = BR * COS , (x3 
addf 2, *ar0;, x3 
subf r2,*ar0++,r4 ; (v4 = AI - TI, BI’ = r3) 
stf r3,*ar3++ 
addf ©0 25,43 ; v3 = TR = r0 + £5 
mpyf *arlt++,r6,xr0 ; rO = BR * SIN , r2 = AR - TR 
subf C3, *axrd, £2 
mpyf *karlt++(ir0),r7,r1 ; vl = BI * COS , (AI’ = r4) 
stf r4, *ar2tt+ 
addf *arOt++,r3,xr3 ; v3 = AR + TR, BR’ = r2 
stf r2, *ar3tt+ 


ll 
> 
H 
+ 
HA 
H 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


E 


Ht 


mpy 
stf 
sub 
mpy 
add 
sub 
stf 
sub 
mpy 
sub 
mpy 
stf 
add 
stf 
mpy 
stf 
add 
mpy 
add 
sub 


Fy FH FH Fh 


Fy FH FH Fh 


a) 


Fy FH FH FH 


3) 
stf 
sub 
mpy 
sub 
mpy 
add 
clear pipe 
Stet 
stf 
add 
add 
stf 
subf 
StL 
stf 


f2end 


bh Fh Fh FH FH 


ine 


etarl, £7, 65 
r3,*ar2++ 
rl1,;r0; £2 
*arl,r6,r0 

£2, *%ar0, £3 

r2, *ar0t++(ir0),r4 
r3, *ar3++(ir0) 
£0,625, x43 
Farle, C7, 20 
r3,*ar0,r2 
*arlt++,r6,r1 

r4, *ar2++ (ir0) 
*arQ++, £3,565 
r2,*ar3t++ 
etarl,xr7,2£5 
r5,*ar2++ 
rl,r0; r2 
Far, 26,40 
r2,*axr0,;xr3 
r2,*ar0++,r4 


r3,*ar3t++ 
r0,r5,r3 
*Farl++,. £7, £0 
£3,*ar0,r2 
*karlt++(ir0),r6,r1 
*arQ++, 63,63 


v2, *ar3t+t+ 
v4, *ar2tt+ 
ril,EO,r2 
r2, *axr0,r3 
r3, *ar2++ 
r2,*ar0,r4 
r3,*ar3 
r4,*ar2 


’ 


r5 = BI * COS , (AR’ = r3) 
(r2 = TI = rO - rl) 
r0 = BR * SIN, (r3 = AI + TI) 
(r4 = AI - TI, BI’ = r3) 
r3 = TR = r5 - r0 
r0 = BR * COS , r2 = AR - TR 
rl = BI * SIN, (AI’ = r4) 
r5 = AR + TR, BR’ = r2 
r5 = BI * COS , (AR’ = r5) 
(r2 = TI = r0O + rl) 
r0 = BR * SIN, (r3 = AI + TI) 
(x4 = AI - TI, y(L) = BI’ 
r3 = TR = r5 - r0 
r0 = BR * COS , r2 = AR - 
rl = BI * SIN, r3 = AR + 

; BR’ = r2 , AI’ = x4 
r2 = TI = r0 + rl 
r3 = AI + TI, AR’ = £3 
r4 = AI - TI, BI’ = r3 

AI’ = r4 


KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK 


------------- LAST STAGE 


* 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KAKA KKK KKK KKK KKK 


* 


ldi 
ldi 
ldi 
ldi 
ldi 
ldi 
ldi 
fill pipeline 


@inputp, ar0 
ar0O,ar2 
@inputp2,arl 
arl,ar3 
@sintp2,ar7 
3,ir0 
@f£g4m2,rce 


, 
, 
, 


’ 


upper output 


lower output 
pointer to twiddle factors 
group offset 


Applications-Oriented Operations 


6-51 


Fast Fourier Transform 


s (FFTs) 


Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


* 


1. butterfly: 
addf 
subf 
addf 
subf 

butterfly: 
addf 
ldf 
ldf 
rptbd 
subf 
stf 
stf 
stf 

3. to M. butte 

loop bflend 
ldf 
stf 
ldf 
stf 
mpyf 
stf 
addf 
mpyf 
addf 
subf 

Str 

addf 

mpyf 
subf 
mpyf 
stf 

addf 
stf 

mpyf 
stf 

subf 
mpyf 
addf 
subf 
stf 

subf 
mpyf 
subf 
mpyf 
addf 

* clear pipeline 


* 


2. 


* 
* 


bflend 


w*0 
*ar0,*arl,r6 
*karlt++, *ar0t++,r7 
*ar0,*arl,r4 ; 
*karlt++(ir0),*ar0++(ir0),4r5; 

w*M/4 
*tarl,*ar0,r3 
ear] ,eL 
*arlt++,r0 
bflend 
karlt++(ir0),*ar0++, r2 
r6, *ar24 
r7,*ar34 
r5,*ar34 

rfly: 


, 


’ 


Ne Ne Ne Ne 


t (irO) 


*xarT7t++,r7 
v4, *ar2++(ir0) 
*ar7++,r6 
r2,*ar34 
earl! 26:65 
r3,*ar24 
FL,rO,;xr2 
z= i el res al geome ©) 

r2, *ar0, x3 
r2,*ar0++(ir0),r4 
v3, *ar3++(ir0) 
£0, 45,43 
*arlt++,r6,xr0 
r3,*ar0,r2 
karlt++(ir0),r7,r1 
v4, *ar2++(ir0) 
*arOt++,r3,xr3 
r2,*ar34 
eae lc 7 5 
r3,*ar24 
r1,r0,xr2 
*arl,r6,r0 
r2,*ar0,r3 
r2,*ar0t++(ir0),r4 
v3, *ar3++(ir0) 
£0; 45,43 
*x*arlt++,r7,x0 
r3,*ar0,r2 
*karlt++(ir0),r6,rl1 
*arOt++,r3,xr3 


’ 


, 


’ 


Va 


AR’ = r6 = AR + BR 
BR’ = r7 = AR - BR 
AI’ = r4 = AI + BI 
BI’ = r5 = AI - BI 
AR’ = r3 = AR + BI 
rl = 0 (for inner loop) 
rO = BR (for inner loop) 
Setup for loop bflend 
; BR’ = r2 = AR - BI 
(AR’ = r6) 
(BR’ = r7) 
(BI’ = r5) 
r7 = COS , ((AI’ = r4)) 
r6 = SIN , (BR’ = r2) 
r5 = BI * SIN, (AR’ = r3) 
(r2 = TI = r0O + rl) 
rO = BR * COS , (r3 = AI + TI) 
(r4 = AI - TI , BI’ = r3) 
r3 = TR = r0 + r5 
rO = BR * SIN , r2 = AR - TR 
rl = BI * CoS , (AI’ = r4) 
r3 = AR + TR, BR’ = r2 
r5 = BI * COS , (AR’ = r3) 
(r2 = TI = r0 - rl) 
r0 = BR * SIN, (r3 = AI + TI) 
;(r4 = AI - TI, BI’ = r3) 
r3 = TR = r0 - r5 
rO = BR * COS , r2 = AR - TR 
;rl = BI * SIN, r3 = AR + TR 
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stf 
[| stf 
addf 
addf 
(a stf 
subf 
1 | stf 
stf 


r2,*ar3t++ 
r4,*ar2++(ir0) 
r1,r0,r2 
r2,*ar0,xr3 
r3,*ar2++ 
Ee, Sar ye 
r3,*ar3 

r4,*ar2 


(AI’ = r4) 


7x2 = TI = r0 + rl 
j7r3 = AI + TI, AR’ = £3 


BI’ = r3 


;AI’ = r4 


KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK 


END OF FFT 


* 


KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KA KKK KKK KKK KK KK 


ENDB: 


KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK 


# === --------- BITREVERSAL 


* 


* This bit-reversal section assume input and output in Re-Im-Re-Im format * 
KEK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KEK KKK KK KKK KKK KAKKEKK KK KA KK KKK 


, 


@inputp, ar0 
@outputp, ard 
INPLACE 
@outputp, arl 


*kar0t++(ir0)b,r1 
r0, *tarl1 (1) 
*+tar0(1),r0 

rl, *arlt++(irl) 


*ar0++(ir0)b,r1 
r0, *+arl (1) 


ri, *arl 


; Return to C environment. 


, 

INPLACE 
rptbd BITRV2 
nop *++arl1 (2) 
nop *ar0++(ir0)b 
nop 


; ar1l=DSR_ADDR 
; irO=FFT_SIZE 
7; CC=FFT_SIZE- 


jirl=2 
;read first Im value 
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Example 6—15. Faster Version Complex Radix-2 DIT FFT (Continued) 


| 
CONT 
BITRV2 


end: 


cmpi 
bgea 


nop 


; Return to C 


POP 
POP 
POP 
POP 
POP 
POP 
POPF 
POP 
POPF 
POP 
POP 
POP 
RETS 
.end 


arl,ar0O 
1s CONT 
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Example 6—16. Bit-Reversed Sine Table 


KKK KKK KEKE KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEKE KK KKK KKKKKK KKK KKKKHK 
* 
* SINTAB.ASM : Bit-reversed sine table for a 64-point 
* File to be linked with the source code for a 
- 64-point radix-2 DIT FFT 
* Sine table length = FFT size / 2 
* 
KKK K KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KEKE KE KKK KKK KK KKK KKK KKK 

-global _SINE 

-sect ”.sintab” 
_SINE 

. float 1.000000 

- float 0.000000 

- float 0.707107 

- float 0.707107 

. float 0.923880 

- float 0.382683 

. float 0.382683 

. float 0.923880 

. float 0.980785 

- float 0.195090 

. float 0.555570 

- float 0.831470 

- float 0.831470 

. float 0.555570 

- float 0.195090 

-float 0.980785 

- float 0.995185 

. float 0.098017 

. float 0.634393 

- float 0.773010 

. float 0.881921 

- float 0.471397 

. float 0.290285 

. float 0.956940 

- float 0.956940 

. float 0.290285 

. float 0.471397 

. float 0.881921 

- float 0.773010 

. float 0.634393 

- float 0.098017 

- float 0.995185 

-end 


Applications-Oriented Operations 6-55 


Fast Fourier Transforms (FFTs) 


6.5.4 Real Radix-2 FFT 


Most often, the data to be transformed is a sequence of real numbers. In this 
case, the FFT demonstrates certain symmetries that permit the reduction of 
the computational load even further. Example 6-17 and Example 6-18 show 
the generic implementation of a real-valued radix-2 FFT (forward and inverse). 
For such an FFT, the total storage required for a length-N transform is only N 
locations; in a complex FFT, 2N are necessary. Recovery of the rest of the 
points is based on the symmetry conditions. A companion table 
(Example 6-13) should be used to provide the twiddle factors. 


Example 6-17. Real Forward Radix-2 FFT 


KKK KKK KK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KK 


FILENAME 


DATE 
VERSION 


VERSION 


DESCRIPTION 


KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK 


DAT 


FFFT_RL.ASM 

REAL, RADIX-2 DIF FFT FOR TMS320C40 
1/19/93 

360 


E CO. 


ENTS 


1.0 


2.0 


int 
int 
float 
float 


float 
int 


NOTE: 


La 


+ + FF FF F FF FF FF FF HF F FF FF HF FF FF FF FF FF FF FF FF FF HF FH 


SYNOPSIS: 


int ffft_rl 


7/18 


7/23 


1/19 


KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK KKK KKK KKK KKK KK 


/91 


ALEX TESSAROLO(TI Australia): 

Original Release (C30 version) 

ALEX TESSAROLO(TI Australia): 

Most Stages Modified (C30 version). 

Minimum FFT Size increased from 32 to 64. 
Faster in place bit reversing algorithm. 
Program size increased by about 100 words. 

One extra data word required. 

ROSEMARIE PIEDRA(TI Houston): 

C40 porting started from C30 forward real FFT 
version 2.0. Expanded calling conventions to the use 
of registers for parameter passing. 


/92 


193 


(FFT_SIZE, LOG_SIZE, SOURCE_ADDR, DEST_ADDR, SINE_TABLE, BIT_REVERSE) 
ar2 r2 3 ¥O re re 
FFT_SIZE ; 64, 128, 256, 512, 1024, 
LOG_SIZE re. 65 Ty 8, 9, LO}. aca 
* SOURCE_ADDR ; Points to location of source data. 
*DEST_ADDR ; Points to where data will be 
; operated on and stored. 
*SINE_TABLE ; Points to the SIN/COS table. 
BIT_REVERS ; = O, Bit Reversing is disabled. 
; <> 0, Bit Reversing is enabled. 
1) If SOURCE_ADDR = DEST_ADDR, then in place bit reversing 
is performed, if enabled (more processor intensive). 


2) FFT_SIZE 


must be >= 64 


(this is not checked). 
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* 


KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK 


+ 


+ + + + FF F FF F FF FFF FFF FF FFF FF + FF + FF + FF + F FF FF + FF FF F FF F FF HF FH 


KKK KKK KKK KKK KR KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KEK KKK KKK KKK KKK KKK 


DESCRIPTION: 


Generic function to do a radix-2 FFT computation on the C40. 

The input data array is FFT_SIZE-long with only real data. The output is 
stored in the same locations (in-place) with real and imaginary 

points R and I as follows: 


DEST_ADDR[0] -> R(0) 
R(1) 
R(2) 
R(3) 
R(FFT_SIZE/2) 
I(FFT_SIZE/2 - 1) 
I (2) 
DEST_ADDR[FFT_SIZE - 1] -> I(1) 


The program is based on the FORTRAN program in the paper by Sorensen et al., 
June 1987 issue of Trans. on ASSP. 


Bit reversal is optionally implemented at the beginning of the function. 


The sine/cosine table for the twiddle factors is expected to be supplied in 
the following format: 


SINE_TABLE[0] -> sin (0*2*pi/FFT_SIZI 
sin(1*2*pi/FFT_SIZI 


GIGI 


sin((FFT_SIZE/2-2) *2*pi/FFT_SIZE 
SINE_TABLE[FFT_SIZE/2-1] -> sin ((FFT_SIZE/2-1) *2*pi/FFT_SIZE 


NOTE: The table is the first half period of a sine wave. 


NOTES: 1. Calling C program can be compiled with large or small model. Both 
calling conventions methods: stack or register for parameter 
passing are supported. 


2. Sections needed in linker command file: .ffttxt : fft code 
.fftdat : fft data 


3. The DEST_ADDR must be aligned such that the first LOG_SIZE bits 
are zero (this is not checked by the program) 


Caution: DP initialized only once in the program. Be wary with interrupt 
service routines. Make sure interrupt service routines save the DP 
pointer. 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


+ + FF F FF FFF HF FF FF FF FF HF + FH 


REGISTERS USED: RO, Rl, R2, R3, 
ARO, AR1, AR2, 
IRO, IR1 
RC, RS, RE 
DP 
MEMORY REQUIREMENTS: Program = 
Data = 
Stack = 


AR3, 


KKK KKK KK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KR KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK 


R4, R5, 


AR4, 


R6, R7 


AR5, AR6, ART 


405 Words 
7 Words 
12 Words 


(approximately) 


KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK KKK KKK KK KKK KKK 


O wait state. 


n labels STARTB and ENDB. 


for faster performance 


BENCHMARKS: Assumptions — Program in RAMO 
— Reserved data in RAMO 
— Stack on Local/Global Bus RAM 
Sine/Cosine tables in RAMO 
— Processing and data destination in RAM1. 
-— Local/Global Bus RAM, 
FFT Size Bit Reversing Data Source Cycles (C40) 
Cee 
* 1024 OFF RAM1 19404 approx. 
* 
* Note: This number does not include the C callable overheads. 
7 This benchmark is the number of cycles betw 
* 
* NOTE: 
x -— If .ffttxt is located off-chip, enable cach 
* 
KKEKKKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKEKKKKKKKA KKK KKK KK 
* 
FP .set AR3 
-global pice a ea ll ; Entry execution point. 
-global STARTB,ENDB 
FFT_SIZE: -usect ".fftdat”,1 ; Reserve memory for arguments. 
LOG_SIZE: -usect "  £itdat"™ 1 
SOURCE_ADDR: -usect “ £ttdat" p11 
DEST_ADDR: -usect “ €ftdat” 1 
SINE_TABLE: -usect " Ee ttdat™ 74 
BIT_REVERSE: .usect “ £ttdat ,1 
SEPARATION: -usect " fftdat”’,1 
* 
* Initialize C Function 
* 
-sect " L£ECERC”™ 
ffft_rl: PUSH FP ; Preserve C environment. 
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Nee Ne Ne Ne Ne Ne Ne Ne Ne Ne Ne 


Ne Ne Ne Ne Ne 


LDI SP, FP 

PUSH R4 

PUSH R5 

PUSH R6 

PUSHF R6 

PUSH R7 

PUSHF R7 

PUSH AR4 

PUSH AR5 

PUSH AR6 

PUSH AR7 

PUSH DP 

LDP FFT_SIZE ; 
cade: . REGPARM== ; 
LDA *-FP (2) ,AR2 

EDT. *-FP (3),R2 

LDI *=FP(4),R3 

LDI *-FP(5),RC 

LDI *-FP(6),RS 

LDI *-FP(7),RE 
endif 

olt AR2, @FE SIZE 
Sir R2,@LOG_SIZE 
STI R3, @SOURCE_ADDR 
Olt RC, @DE ADDR 
STI RS, @SINE_TABLE 
Siti RE, @BIT_REVERSE 


Check Bit Reversing Mode 


BIT_REVERSING = 0, then OFF 
BIT_REVERSING <> 0, Then ON. 
LDI @BIT_REVERSE, RO 
BZ MOVE_DATA 


Check Bit Reversing Type. 


If SourceAddr = 
If SourceAddr <> DestAddr, 


LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BEQ IN_PLACE 


Bit reversing Type 


NOTE: 


abs (SOURCE_ADDR - DEST_ADDR) 


Initialize DP pointer. 
arguments passed in stack 


{on or off). 


(no bit reversing). 


DestAddr, Then In Place Bit Reversing. 
Then Standard Bit Reversing. 


From Source to Destination). 


must be > FFT_SIZE, 


this is not checked. 
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Example 6-17. Real Forward Radix-2 FFT (Continued) 


LDI @FFT_SIZE, RO 
SUBI 2,R0 
LDA @FFT_SIZE, IRO 
LSH -1,IRO ;IRO = Half FFT size. 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R 

|| STF R1, *AR1++(IRO)B 
STF R1, *AR1++(IRO)B 
BR STARTB 


, 
; In Place Bit Reversing. 
; Bit Reversing On Even Locations, lst Half Only. 


IN_PLACE: LDA @FFT_SIZE, IRO 
LSH —2,I1RO ;IRO = Quarter FFT size. 
LDA 2,IR1 
LDI @FFT_SIZE,RC 
LSH -2,RC 
SUBI 3, RE 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
LDA ARO, AR2 
NOP *AR1++(IRO)B 
NOP *AR2++(IRO)B 
LDF *++ARO (IR1),RO 
LDF *AR1,R1 
RPTBD BITRV1 
CMP I AR1, ARO 7Xchange Locations only if ARO<AR1. 
LDFGT RO,R1 
LDFGT *AR1++(IRO)B,R1 
LDF *++ARO0 (IR1),RO 
|| STF RO, *ARO 
LDF *AR1,R1 
|| STF R1, *AR2++(IRO)B 
CMP I AR1,ARO 
LDFGT RO,R1 
BITRV1: LDFGT *AR1++ (IRO)B, RO 
STF RO, *ARO 
STF R1, *AR2 


, 
;Perform Bit Reversing, Odd Locations, 2nd Half Only 


r 
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EDI @FFT_SIZE, RC 
LSH =1,RC 
LDA @DEST_ADDR, ARO 
ADDI RC, ARO 
ADDI 1,AR0 
LDA ARO, AR1 
LDA ARO, AR2 
LSH -1,RC 
SUBI 3, RC 
OP *AR1++(IRO)B 
OP *AR2++(IRO)B 
LDF *++AR0 (IR1),RO 
LDF *AR1,R1 
RPTBD BITRV2 
CMP I AR1,ARO 7Xchange Locations only if ARO<AR1 
LDFGT RO,R1 
LDFGT *AR1++(IRO)B,R1 
LDF *++AR0 (IR1),RO 
|| STF RO, *ARO 
LDF *AR1,R1 
|| STF R1, *AR2++(IRO)B 
CMP I AR1, ARO 
LDFGT RO,R1 
BITRV2: LDFGT *AR1++(IRO)B,RO 
STF RO, *ARO 
STE R1, *AR2 
;Perform Bit Reversing, Odd Locations, lst Half Only 
LDI @FFT_SIZE,RC 
LSH =1,RC 
LDA RC, IRO 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
ADDI 1,ARO 
ADDI TRO, AR1 
LSH =1,RCG 
LDA RC, IRO 
SUBI 2,RC 
RPTBD BITRV3 
OP ;Note: could be instruction 
LDF *ARO, RO 
LDF *AR1,R1 
LDF *++AR0 (IR1),RO 
Li STF RO, *AR1++(IRO)B 
BITRV3 LDF *AR1,R1 
\ | STF R1, *-ARO (IR1) 
STF RO, *AR1 
\ | STF R1, *ARO 
BR STARTB 
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Example 6—17. Real Forward Radix-2 FFT (Continued) 


Check Data Source Locations. 


<td 


If SourceAddr = DestAddr, Then do nothing. 
If SourceAddr <> DestAddr, Then move data. 
IOVE_DATA: LDI @SOURCE_ADDR, RO 
CMPI @DEST_ADDR, RO 
BE STARTB 
LDI @FFT_SIZE, RO 
SUBI 2,R0 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R1 
| | STF R1, *AR1++ 
STF R1, *AR1 


Perform first and second FFT loops. 


, 
; 
, 
; | AR1L -> |__I1 QO <- [X(I1) + X(I2)] + [X(I3) + X(14)] 
; | AR2 -> |__1I2 1 <- [X(I1) - X(I2)] 
; | AR3 -> |__I3 2 <- [X(I1) + X(12)] - [X(I3) + X(I4)] 
; |. AR4 -> |__14 3 <- -[X(I3) - X(1I4)] 
* AR1 -> | 4 
; | 
i 
i f 
; \I/ 
, 
STARTB: LDA @DEST_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
LDA AR1,AR4 
ADDI 1,AR2 
ADDI 2,AR3 
ADDI 3,AR4 
LDA 4,1RO 
LDI @FFTI_SIZE,RC 
LSH -2,RC 
SUBI 2,RC 
LDF *AR2,RO ; RO X (12) 
| | LDF *AR3,R1 ; Rl = X(I3) 
ADDF3 R1, *AR4,R4 ; R4 = X(13) + X(14) 
SUBF3 R1, *AR4++(IRO),R5 ; R5 = -[X(13) - X(14)] --+ 
SUBF3 RO, *AR1,R6 ; R6 X(I1) - X(I2) --+ | 
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Fast Fourier Transforms (FFTs) 


: 
RPTBD LOOP1_2 ; 
ADDF3 RO, *AR1++(IRO),R7 >; R7 = X(I1) + X(12) 
ADDF3 R7,R4,R2 y RZ = RY + R4 -=-=--4 + 
SUBF3 R4,R7,R3 , R3 = R7 - R4 --+ 
, 
LDF *+AR2 (IRO),RO ; 
LDF *+AR3(IRO),R1 ; 
ADDF3 R1,*AR4,R4 ; 
STF R3, *AR3++ (IRO) ; X(13) <--------4 ' 
SUBF3 R1, *AR4++(IRO),R5 ; 
STF R5, *-AR4 (IRO) ; X(T4) < - + 
SUBF3 RO, *AR1,R6 : 
STF R6, *AR2++ (IRO) ; X(I2) < + 
ADDF3 RO, *AR1++(IRO),R7 ; 
STF R2,*-AR1 (IRO) ; X(I1) <-----------4 ' 
ADDF3 R7,R4,R2 
LOOP1_2 SUBF3 R4,R7,R3 
SIF R3, *AR3 
STF R5, *-AR4 (IRO) 
STF R6, *AR2 
STF R2,*-AR1 (IRO) 
' i 
; Perform Third FFT Loop. 
, 
; Part. A 
ri — 
; | AR1 -> _ ii.) @ <= X(rL) + Xi{T3) 
i | 1 
i | __I2__| 2 
i | 3 
; AR2 -> 13 | 4 <- X(I1) - X(T3) 
i | 5 
; | AR3 -> _ 14 | 6 <- -X(T4) 
i = 7 
; AR1 -> 8 
7 9 
, 
, 
, 7 
i \I/ 
LDA @DEST_ADDR, AR1 
LDA AR1, AR2 
LDA AR1,AR3 
ADDI 4, AR2 
ADDI 6,AR3 
LDA 8, IRO 
LDI @FFT_SIZE, RC 
LSH =3;, RC 
SUBI 27.RC 
RPTBD LOOP3_A 
SUBF3 *AR2, *AR1,R1 
ADDF3 *AR2,*AR1,R2 
NEGF *AR3,R3 
LDF *+AR2 (IRO),RO ; RO = X(I3) 
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STF R2,*AR1++(IRO) 
SUBF3 RO, *AR1,R1 ; RL = X(I1) - X(1I3) ----- + 
STF R1, *AR2++ (IRO) 7 | 
ADDF3 RO, *AR1,R2 * R2 = X(I1T) + X(13) =—-+ 
STF R3, *AR3++ (IRO) 4 | 
LOOP3_A: NEGF *AR3,R3 3; R3B = -X(1I4) --4 | 
| 
f 
STF R2, *AR1 »* X(T) < + | 
STF R1, *AR2 » KCTS) = + 
STF R3,*AR3 » M(T4) <-=-===——<4+ 
, 
> Part. ‘By 
, = 
; | 0 
; | ARO => |_ Tl | 1 <= X(T1) + [X(T3)*COS + X(1I4) *CoOSs] 
i | 2 
; | ARL =—> |__1I2 | 3 <= xX(T1) = [X(I3)*COS + X(T4) *COS] 
; | 4 
; | AR2 => |_ 73 | 5 <= =K(T2) = [K(I3)*COS = X(14)*COSs] 
i | 6 
: [JARS => |. T4 | 7 <= X(T2) -— [X(I3) *COs = X(14)*CcOSs] 
7 8 
, 
; ARO -> 9 NOTE: COS(2*pi/8) = SIN(2*pi/8) 
, 
, 
, z 
; \I/ 
2 
LDI @FFT_SIZE, RC 
LSH =37. RC 
LDA RC, IR1 
SUBI 3,RC 
LDA 8, IRO 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
LDA ARO, AR2 
LDA ARO, AR3 
ADDI 1, ARO 
ADDI 3,AR1 
ADDI 5,AR2 
ADDI 7,AR3 
LDA @SINE_TABLE, AR7 ; Initialize table pointers. 
LDF *++AR7(IR1),R7 ; R7 = COS (2*pi/8) 
; *AR7T = COS (2*pi/8) 
MPYF3 *AR7, *AR2,RO ; RO = X(I3)*COS 
MPYF3 ARS, RY, RI ; RS = X(14)*COS 
ADDF3 RO,R1,R2 ; R2 = [X(I3)*COS + X(1I4)*COS] 
MPYF3 *AR7, *+AR2 (IRO),RO 
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ST ee. eT 


UBF 3 
UBF3 


nn 


3 


SI 


rj 


ANP NNNPNNPNHZTENPHNNNHNPHNUH ZTPHnNEZENPHNHNPDW 


RO,R1 
*ARI, 


LOOP 3 


,R3 
R3,R4 


B 


*AR1,R3,R4 


R4,*A 
R2,*A 
R4,*A 
*ARO, 
R4,*A 
*AR3, 
R4,*A 
RO,R1 
*ART, 
RO,R1 
*ARI, 
*ARI, 
R4,*A 
R2,*A 
R4,*A 
*BRO, 
R4,*A 
*AR3, 
R4,*A 
RO,R1 
RO,R1 
*ARI, 
*ARI, 
R4,*A 
R2,*A 
R4,*A 
*ARO, 
R4,*A 
R4,*A 


Perform Fourth FFT Loop. 


Part A: 


AR1-> 


AR2-> 


AR3-> 


|__14__| 


R2++(IRO) 
RO,R4 
R3++ (IRO) 
R2,R4 
R1++(IRO) 
R7,R1 
RO++ (IRO) 
,R2 

*+AR2 (IRO),RO 
,R3 

R3,R4 
R3,R4 
R2++ (IRO) 
RO,R4 


QO <= X(I1) 


©O 
A 
| 
bat 
H 
i 


12 <- -X(T4) 


, R3 = -[X(13)*COS - X(14) *COS] 
, R4 = -X(1I2) + R3 --+ 
, R4 = X(12) + R3B --|--+ 
* XCL3) -<Ssasaesseeea4 + 
; R4 = X(I1) - R2 --+ 
; X(I4) < + 
; R4 = X(I1) + R2 --|--+ 
p K(1L2). <=s-s===—=-=—<4 t 
* RXTL) < + 
+ X(I3) 
= (13) 
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MeN Ne Ne Ne we 


LOOP4_A: 


nnn 
q 
| 


ARO 


AR1 


AR2 


AR4 


AR3 


ARO 


ZNPunvnney srPnDaHer rr PPHHE 
UO 
ry 
w 


\1/ 


@DEST_ADDR, AR1 


AR1,AR2 
AR1,AR3 

8, AR2 

12,AR3 

16, IRO 
@FFT_SIZE,RC 
-4,RC 

2,RC 

LOOP4_A 
*BR2,*AR1,R1 
*BR2,*AR1,R2 
*AR3,R3 
*+AR2 (IRO),RO 


RO, *AR1,R1 


RO, *AR1, R2 


*AR3,R3 
R2,*AR1 
R1, *AR2 
R3,*AR3 
__ Ti (3£d),_ 
__T1_ (2nd) _ 
__ Fi (ist). 
i? 1st) 
__I12_(2nd)_ 
__I2. (3¥d),_ 
__13_ (3rd)_ 
__ 13. (2nd) _ 
TS st) 
I4_ (1st) 
__ 14 (2nd) 
I4_ (3rd) 
| 16 
| 
MALY 


R2, *AR1++ (IRO) 
R1, *AR2++ (IRO) 


R3, *AR3++ (IRO) 


7; RO X (13) 

;R1l = 
;R2 = 
7R3 = 
X(T) 
7;X(13) 
;X (14) 

0 

1 <- X(I1) + 

2 ‘ 

3 

4 

5 

6 ‘ 

7 <- X(I1) - 

8 

9 <= =K(12) - 

10 : 

11 

12 

13 

14 ‘i 

15 <= X20 = 

17 


X(I1) - X(13) -----4 ; 
X(I1) + X(I3) --+ 
-X (14) --+ 

2 F 

g + 
Se: . 

[X(I3) *COS + X(I4) *SIN] 
[X(I3) *COS + X(I4) *SIN] 
[xX (I3) *SIN - X(14) *COS] 
[xX (I3) *SIN - X(I4) *COS] 
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Pe PrrrrrryryrPHrnNrree 


E 


i 


ZSIZMNZNMPNNHNPHHZPHNER 


DI 


DON 
PP 


UBL 


GU00DU 
Dp DD 


iw) 
1s) 
H 


ri 


mj 


3 


mj 


3 


@FFT_SIZE, RC 
-4,RC 

RC, IR1 

2, IR0 

3,RC 
@DEST_ADDR, ARO 
ARO, AR1 

ARO, AR2 

ARO, AR3 

ARO, AR4 

1, ARO 

7, BRL 


@SINE_TABLE, AR7 
*++AR7 (IR1),R7 


AR7,AR6 
*++AR6 (IR1),R6 


AR6,AR5 
*++AR5 (IR1),R5 


16,IR1 

*AR7, *AR4, RO 
*++AR2 (IRO),R5,R4 
*—-AR3(IRO),R5,R1 
*AR7, *AR3, RO 
RO,R1,R2 

*AR6, *-AR4, RO 
R4,R0,R3 
*—--AR1(IRO),R3,R4 
*AR1,R3,R4 

R4, *AR2-- 
R2,*++ARO(IRO),R4 
R4, *AR3 
*BRO,R2,R4 

R4, *AR1 


*++AR3,R6,R1 
R4, *ARO 
RO,R1,R2 
*AR5,*-AR4 (IRO),RO 
RO,R1,R3 
*+4+AR1,R3,R4 
*AR1,R3,R4 
R4, *AR2 
R2,*--ARO,R4 
R4, *AR3 
*BRO,R2,R4 
R4, *AR1 
*—--AR2,R7,R4 
R4, *ARO 
*+4AR3,R7,R1 
*AR5, *AR3,RO 


. 
7) 
= 

ll 


SIN (1* [2*pi/16]) 
COS (3* [2*pi/16]) 


as 
a 
oe) 
~~ 

I 


R6 = SIN(2*[2*pi/16]) 
= COS (2*[2*pi/16]) 


‘Ne 


* 
> 
3) 
fo) 

| 


R5 = SIN(3*[2*pi/16]) 


oN 


;*ARS = COS(1*[2*pi/16]) 

;RO X (13) *COS (3) 

7R4 = X(1I3)*SIN(3) 

7R1 = X(14)*SIN(3) 

,;RO = X(14) *COS (3) 

;R2 = [X(I3)*COS + X(14)*SIN] 
7;R3 = -[X(1I3)*SIN - X(14) *COS] 
;R4 = -X(I12) + R3 --+ 

;R4 = X(I2) + RB --|-—+ 
;XUI3) <-se—e=-5+--—4 + 

;R4 = X(I1) = BR2 -=—+ 

sX(14) < + 

j;R4 = &(IL) + R2 ==-|—=—-+ 

PX (I2) <————=-—--——--—4 t | 

; | 

; | 

PX (IL) =< + 
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ADDF3 RO,R1,R2 

MPYF3 *ART, *++AR4 (IR1),RO 

SUBF3 R4,R0,R3 

SUBF3 *++AR1,R3,R4 

RPTBD LOOP 4_B 

ADDF3 *AR1,R3,R4 

STF R4, *AR2++(IR1) 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3++(IR1) 

ADDF3 *BRO,R2,R4 

STF R4, *AR1++(IR1) 

MPYF3 *++AR2 (IRO),R5,R4 

STF R4, *ARO++(IR1) 

MPYF3 *—--AR3(IRO),R5,R1 

MPYF3 *AR7, *AR3, RO 

ADDF3 RO,R1,R2 

MPYF3 *AR6, *-AR4, RO 

SUBF3 R4,R0,R3 

SUBF3 *—--AR1(IRO),R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2-- 

SUBF3 R2,*++ARO(IRO),R4 

STF R4,*AR3 

ADDF3 *BRO,R2,R4 

STF R4,*AR1 

MPYF3 *++AR3,R6,R1 

STF R4,*ARO 

ADDF3 RO,R1,R2 

MPYF3 *AR5, *-AR4 (IRO),RO 

SUBF3 RO,R1,R3 

SUBF3 *++AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3 

ADDF3 *ARO,R2,R4 

STF R4,*AR1 

MPYF3 *--AR2,R7,R4 

STF R4, *ARO 

MPYF3 *++AR3,R7,R1 

MPYF3 *ARS5, *AR3,RO 

ADDF3 RO,R1,R2 

MPYF3 *ART, *++AR4 (IR1),RO 

SUBF3 R4,R0,R3 

SUBF3 *++AR1,R3,R4 

ADDF3 *AR1,R3,R4 

STF R4, *AR2++(IR1) 

SUBF3 R2,*--ARO,R4 

STF R4, *AR3++(IR1) 
LOOP4_B: ADDF3 *BRO,R2,R4 

STF R4, *AR1++(IR1) 

MPYF3 *++AR2 (IRO),R5,R4 

STF R4, *ARO++(IR1) 

MPYF3 *—--AR3(IRO),R5,R1 

MPYF3 *AR7, *AR3, RO 

ADDF3 RO,R1,R2 

MPYF3 *AR6, *-AR4, RO 


6-68 


Example 6—17. Real Forward Radix-2 FFT (Continued) 


Fast Fourier Transforms (FFTs) 


ST re 


UBF 3 
UBF 3 
DDF3 
TE 
UBF 3 
acy 
DDF3 
TE 
PYF3 
IF 
DDF3 
PYF3 
UBF 3 
UBF 3 
DDF3 
TE 
UBF 3 
TE 
DDF3 
TE 
PYF3 
Ee 
PYF3 
PYES 
DDF3 
UBF 3 
UBF 3 
DDF3 
TE 
UBF 3 
TE 
DDF3 
TE 
TE 


3 


3 


ANP NNNHNPHNUHN PR EN ZENPUHNNNPHNVNEZPHNEZENPHNNNPHN 


4 


R4,R0,R3 
*—--AR1(IRO),R3,R4 
*AR1,R3,R4 
R4, *AR2-- 
R2,*++ARO(IRO),R4 
R4, *AR3 
*BRO,R2,R4 
R4, *AR1 
*+4+AR3,R6,R1 
R4, *ARO 
RO,R1,R2 
*AR5,*-AR4 (IRO),RO 
RO,R1,R3 
*+4+AR1,R3,R4 
*AR1,R3,R4 
R4, *AR2 
R2,*--ARO,R4 
R4, *AR3 
*BRO,R2,R4 
R4, *ARL 
*—--AR2,R7,R4 
R4, *ARO 
*+4AR3,R7,R1 
*AR5, *AR3,RO 
RO,R1,R2 
R4,R0,R3 
*+4+AR1,R3,R4 
*AR1,R3,R4 
R4, *AR2 
R2,*--ARO,R4 
R4, *AR3 
*BRO,R2,R4 
R4, *AR1 

R4, *ARO 


Perform Remaining FFT loops (loop 4 onwards). 


AR1-> 


Xx (TD) 8 16 


X(I2)_ (3rd) 13 29 
( 


Xx’ (11) + X’ (13) 


+ [X(I3)*COS + X(I14) *SIN] 


LOOP 
ist 2nd 
\/ \/ 

X’ (11) 0 QO <- 
X(I1)_(1st) 1 Ll <= X (Tl) 
X(I1)_ (2nd) 2 2 ¥ 
x (I1)_ (3rd) 3 3 


X(12)_ (2nd) 14 30 7 

X(12)_ (1st) 15 31 <- X(I1) - [X(1I3)*COS + X(14) *SIN] 
X’ (13) 16 32) <= Ko (Tl) — xX (13) 

X(13)_ (1st) Ly 33) 6= = xCE2) [X(I3) *SIN -— X(14) *COS] 
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; X(13)_ (2nd) 18 34 
7 X(13)_ (3rd) 19 35 
; < 
, 
; Cc -> 
; Xx’ (14)__ | 240 48 <- xX!’ (14) 
H De => . 
, 
, 
7 X(14)_ (3rd) 29 61 
: X(14)_ (2nd) 30 62 “ 
; AR4-> X(14)_ (1st) 31 63 <- X(I2) - [X(I3)*SIN - X(14) *COS] 
; 32 64 
a AR1-> 33. 65 
, 
, 
; , 
i NILZ 
, 
LDA @FFT_SIZE, IRO 
LSH -2,IRO0 
STI IRO, @SEPARATION 
LSH =2Z, LBRO 
LDI 5,R5 
LDI 3,R7 
LDI 16,R6 
LDA @DEST_ADDR, AR5 
LDA @DEST_ADDR, AR1 
LSH -1,2R0 
LSH 1,R7 
LOOP: ADDI 1B 
LSH 1,R6 
LDA AR1,AR4 
ADDI R7,AR1 ;AR1 points at A. 
LDA AR1,AR2 
ADDI 2,AR2 ;AR2 points at B. 
ADDI R6,AR4 
SUBI R7,AR4 ;AR4 points at D. 
LDA AR4, AR3 
SUBI 2,AR3 7;AR3 points at C. 
LDA @SINE_TABLE, ARO ;ARO points at SIN/COS table. 
LDA R7,IR1 
LDI R77, RC 
INLOP: ADDF3 *——AR1 (IR1),*++AR2(IR1),RO z;RO = X’ (11) + Xf (I3) -——+ 
SUBF3 *— -AR3(IR1),*AR1++,R1 ;RL = X! (I1) = x’ (13) =+ 
NEGF *—-—AR4,R2 #R2 = =X" (14) =-+ 
|| STF RO, *-AR1 ;X’ (I1) < | | + 
STF R1, *AR2-- ;X’ (13) < | + 
|| STF R2, *AR4++(IR1) 7X’ (14) <------- + 
LDA @SEPARATION, IR1 ; IRL=SEPARATION BETWEEN SIN/COS 
TABLES 
SUBI 37, RC 
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MPYF3 *++AR0 (IRO), *AR4, R4 ;R4 = X(14) *SIN 
MPYF3 *ARO, *++AR3,R1 ;RL = X(I3)*SIN 
MPYF3 *++ARO (IR1),*AR4,RO ;RO = X(I4)*COS 
MPYF3 *ARO, *AR3, RO ;RO = X(I3)*COS 
|| SUBF3 R1,R0,R3 ;R3 = -[X(I3)*SIN - X(I4)*COS] 
MPYF3 *++ARO (IRO), *-AR4, RO 
L| ADDF3 RO,R4,R2 ;R2 = X(1I3)*COS + X(I4)*SIN 
SUBF3 *AR2,R3,R4 ;R4 = R3B - X(12) --* 
| 
t 
RPTBD IN_BLK ; 
ADDF3 *AR2,R3,R4 ;R4 = R3 + X(I2) --|--* 
STF R4, *AR3++ 7X (13) <----------- x | 
SUBF3 R2,*AR1,R4 ;R4 = X(I1) - R2 --* 
STF R4, *AR4-— ;X(14) < |--* 
ADDF3 *AR1,R2,R4 ;R4 = X(I1) + R2 --|--* 
STF R4, *AR2-- ;X(I2) <----------- x | 
LDF *-ARO(IR1),R3 ; | 
MPYF3 *AR4,R3,R4 ; | 
STF R4, *AR1++ ;X(I1) < * 
MPYF3 *AR3,R3,R1 
MPYF3 *ARO, *AR3, RO 
SUBF3 R1,R0,R3 
MPYF3 *++ARO (IRO), *-AR4, RO 
ADDF3 RO,R4,R2 
SUBF3 *AR2,R3,R4 
ADDF3 *AR2,R3,R4 
STF R4, *AR3++ 
SUBF3 R2,*AR1,R4 
STF R4, *AR4--— 
IN_BLK:  ADDF3 *AR1,R2,R4 
STF R4, *AR2-- 
LDF *-ARO (IRL) ,R3 
MPYF3 *AR4,R3,R4 
STF R4, *AR1++ 
MPYF3 *AR3,R3,R1 
MPYF3 *ARO, *AR3, RO 
SUBF3 R1,R0,R3 
LDA R6,IR1 
ADDF3 RO,R4,R2 
SUBF3 *AR2,R3,R4 
ADDF3 *AR2,R3,R4 
|| STF R4, *AR3++ (IR1) 
SUBF3 R2,*AR1,R4 
|| STF R4, *AR4++(IR1) 
ADDF3 *AR1,R2,R4 
|| STF R4, *AR2++(IR1) 
STF R4, *AR1++(IR1) 
SUBI3 AR5,AR1,RO0 
CMP I @FFT_SIZE, RO 
BLTD INLOP ;LOOP BACK TO THE INNER LOOP 
LDA @SINE_TABLE, ARO ;ARO POINTS TO SIN/COS TABLE 
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LDA R7, IR1 

LDI R7,RC 

ADDI 1,R5 

CMP I @LOG_SIZE,R5 
BLED LOOP 

LDA @DEST_ADDR, AR1 
LSH -1,IR0 

LSH 1,R7 


Return to C environment. 


Ese Ne te 


NDB: POP DP ;Restore C environment variables. 
POP AR7 
POP AR6 
POP AR5 
POP AR4 
POPF R7 
POP R7 
POPF R6 
POP R6 
POP R5 
POP R4 
POP FP 
RETS 
.end 


* 


* No more. 
* 


KKK KK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KK KKK KKK KK KKK KK 
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KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK 


KKK KKK KKK KKK KKK KKK KK KK KKK KKK KKK KKK KKK KKK KKK 


FILENAME : IFFT_RL.ASM 
DESCRIPTION : INVERSE FFT FOR 
DATE : 1/19/93 

VERSION 2 2.0 


KKK KK KKK KKK KKK KKK KK KKK KKK KKK KK KKK 


VERSION DATE COMMENTS 


1.0 2/18/92 DANIEL MAZZOCCO(TI Houston): 
Original Release (C30 version) 
Started from forward real FFT routine written by Alex 


TMS320C40 


KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KK KKK 


KKK KK KKK KKK KKK KKK KKK KK KKK KKK KK KKK 


Tessarolo, rev 2.0 . 

2.0 1/19/93 ROSEMARIE PIEDRA(TI Houston): C40 porting started from 
C30 inverse real FFT version 1.0 (C30). Expanded calling 
conventions to registers for parameter passing. 


KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK 


SYNOPSIS: 
int ifft_rl(FFT_SIZE, LOG_SIZE, SOURCE_ADDR, DEST_ADDR, SINE_TABLE, BIT_REVERSE) ; 
ar2 r2 r3 ro rs re 
int FFT_SIZE ; 64, 128, 256, 512, 1024, 
int LOG_SIZE i; 6, es 8, 9, 10, 
float *SOURCE_ADDR ; Points to where data is originated 
; and operated on. 
float *DEST_ADDR ; Points to where data will be stored. 
float *SINE_TABLE ; Points to the SIN/COS table. 
int BIT_REVERSE ; = 0, Bit Reversing is disabled. 
; <> 0, Bit Reversing is enabled. 
NOTE: 1) If SOURCE_ADDR = DEST_ADDR, then in place bit reversing is 


KKKKK KKK KKK KKK KKK KK KKK KKK KKK KK KKK 


DESCRIPTION: 


+ + + + FF F FF FFF FFF FF FF F FF F FF F FF FF + F FF FF + FF F FF FF FH 


I as follows: 


performed, if enabled (more processor intensive). 
2) FFT_SIZE must be >= 64 (this is not checked). 


Generic function to do an inverse radix-2 FFT computation on the C40. 
The input data array is FFT_SIZE-long with real and imaginary points R and 


KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KKK 
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SOURCE_ADDR[0] - 


T(2) 
SOURCE_ADDR[FFT_SIZE - 1] -> I(1) 
The output data array will contain only real values. Bit reversal is 
optionally implemented at the end of the function. 


The sine/cosine table for the twiddle factors is expected to be supplied in 
the following format: 


SINE_TABLE [0] => sin(0*2*pi/FFT_SIZI 
sin(1*2*pi/FFT_SIZI 


Loo 


sin ((FFT_SIZE/2-2) *2*pi/FFT_SIZI 
SINE_TABLE [FFT_SIZE/2-1] =o sin((FFT_SIZE/2-1) *2*pi/FFT_SIZI 


FALE 


NOTE: The table is the first half period of a sine wave. 


KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KKK KK KKK KKK KKK KK KKK 


NOTE: 1.Calling C program can be compiled using either large or small model. 
Both calling conventions methods: stack or register for parameter 
passing are supported. 


2. Sections needed in linker command file: .ffttxt : fft code 
-fftdat : fft data 


3.The SOURCE_ADDR must be aligned such that the first LOG_SIZE bits 
are zero (this is not checked by the program). 


CAUTION: DP initialized only once in the program. Be wary with interrupt 
service routines.Ensure interrupt service routines save DP pointer. 


KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK KKK KKK KKK KKK KKK KK 


REGISTERS USED: RO, Rl, R2, R3, R4, R5, R6, R7 
ARO, AR1, AR2, AR3, AR4, AR5, AR6, AR7 


IRO, IR1 
RC, RS, RE 
DP 


+ + + FF FF F FF F FF FFF HF FF HF HF FHF FF HF FF HF FF HF FF FF FF FF FF FF FF F FF FF HF FH 


MEMORY REQUIREMENTS: Program 322 Words (approximately) 
Data = 7 Words 
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x Stack = 12 Words 
* 
KEK KKK KKK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KK KKEKKKKKKKKKKAKKAK KKK KKK KK KK 
* 
* BENCHMARKS: Assumptions — Program in RAMO 
7 - Reserved data in RAMO 
* - Stack on Local/Global Bus RAM 
i Sine/Cosine tables in RAMO 
* - Processing and data destination in RAM1. 
7 - Local/Global Bus RAM, 0 wait state. 
* 
* FFT Size Bit Reversing Data Source Cycles (C30) 
 \ Sees 
ba 1024 OFF RAMI 25120 approx. 
* 
* Note: This number does not include the C callable overheads. 
* This benchmark is the number of cycles between labels STARTB and ENDB 
* 
* NOTE: If .ffttxt is located in external SRAM, enable cache for faster 
* performance 
* 
KEK KK KKK KK KKK KKK KK KKK KKK KEK KKK KKK KKK KKK KKK KKK KKK KKK KEK KKK KKK KEK KKEKKAKKEKKKKKKA KKK KK 
FP .set AR3 
-global ifft_rl ;Entry execution point. 
-global STARTB, ENDB 
FFT_SIZE: .usect w TEEC AAG” 31 ;Reserve memory for arguments. 
LOG_SIZE: .usect "  -tEt tat” 1 
SOURCE_ADDR: -usect Y eet rdak” 1. 
DEST_ADDR: -usect ©” Lttidar’ , 1 
SINE_TABLE: -usect " 1fftdat”,1 
BIT_REVERSE: -usect " sitftidat’, I 
SEPARATION: -usect "TEC aAt y I 


; Initialize C Function. 
, 
.sect " LEfttet” 


_ifft_rl: PUSH FP 


RPuvuuUd UU UD UU oe 
Cre 
NNNNNNNNNNWH 
I 
| 
ve) 

I 


DP FFT_SIZE 
-if .REGPARM == 0 


;Preserve C environment. 


; Initialize DP pointer. 
;arguments passed in stack 
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ST i. Ti ee Tee 


[X(I1) 


X (11) 


AR1-> 


AR2-> 


AR3-> 


-X (12) ] *COS-— 


AR4-> 


—-X (12) ] *SIN+ 


AR1-> 


*—-FP (2), ae 
*=EP (3) ; 

*-FP (4), 

* pee ae 
*-FP(6),RS 
*-FP(7),RE 

AR2, @FFT_SIZE 
R2,@LOG_SIZE 
R3, @SOURCE_ADDR 
RC, @DEST_ADDR 
RS, @SINE_TABLE 
RE, @BIT_REVERSE 


Perform Last FFT loops first 


(loop 2 onwards). 


LOOP 
lst 2nd 
\/ \/ 
XP (IL) 0 O <= KK. CEL) - tox (23) 
X(I1)_(1st) d 1 <= X(I1) + X(T2) 
(I1)_ (2nd) 2 2 F 
(I1)_ (3rd) 3 3 
X’ (12) 8 16 <- X’(I2) * 2 
X(1I2)_ (3rd) 13 29 
X(12)_ (2nd) 14 30 F 
-—s (1st) 15 31 <= X(T4) = X(T3) 
(133) 16 32 <= xX’ (I1) - &’ (13) 
ae (1st) 1). 33.<- 
[X(I3)+X(1I4)]*SIN 
X(13)_ (2nd) 18 34 
X(13)_ (3rd) 19. 35 
X’ (14) 24 48 <- -xX’ (14) * 2 
X(14)_ (3rd) 29 61 
X(14)_ (2nd) 30 62 
X(14)_ (1st) 31. 63 <= 
[X(I3)+X(I4)]*Ccos 
32 64 
33 65 
\I/ 
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STARTB: 


LOOP: 


INLOP: 


ag a aa 


PHNONHFPHFHNNENYF ENP NPHHYPHNEUN)P PH PH PHP PHPeENn 


nrn i 


Be 


SAPoOekPnwo 


OrmirrmHwu 
HH 


H 


OUP, 
HH 


ri 


3 


SH 


mj 


TE 


@FFT_SIZE,R7 


@FFT_SIZE, R6 


DDR, ARS 
DDR, AR1 


EA 
EA 


*—-AR1(IR1),*--AR3(IR1),RO 


*AR3,*AR1,RI1 
*--AR4,R2 
RO, *AR1++ 
-2.0,R2 
*— -AR2,R3 
R1, *AR3++ 
2,.0,R3 
R3, *AR2++(IR1) 
R2, *AR4++(IR1) 
@FFT_SIZE,IR1 
@SINE_TABLE, ARO 
-2,IR1 

3,RC 

*AR2,*AR1,R3 
*AR1, *AR2,R2 

R3, *++ARO(IRO),R1 
*AR4,R4 

R3, *++AR0 (IR1),RO 
*AR3,R4,R3 

R4, *AR3, R2 
R2,*AR1++ 
R2,*ARO--(IR1),R4 
R3, *AR2-- 

IN_BLK 
R4,R1,R3 
R2,*ARO,R1 

R3, *AR4-- 
R1,RO,R4 
*AR2,*AR1,R3 
*AR1, *AR2,R2 

R3, *++ARO(IRO),RI1 
R4, *AR3++ 

*AR4,R4 

R3, *++ARO (IR1),RO 


;step between two consecutive sines 
;stage number from 4 to M. 


;R7 is FFT_SIZE/4-1 (ie 15 for 64 
;pts) 

;and will be used to point at A & D. 
;R6 will be used to point at D. 


;R6 is FFT_SIZE at the lst loop 


;AR1 points at A. 
;AR2 points at B. 
;AR4 points at D. 


;AR3 points at C. 


IR1=SEPARATION BETWEEN SIN/COS TBLS 
ARO points at SIN/COS table 


; RO = X" (11) + X’ (13) ---+ 
; RL = X’ (11) - x’ (13) -+ 

; | 

pox (al) < |-+ 
; R2 = -2*x" (14) --+ | 

; | 

p Mi ray = + 

; R3 = 2*xX’ (12) -, 

> X’ (12) <------- ’ 

poh (dy eosaeoo ey : 


R3 = X(1I1)-xX(12) 


R2 = X(11)+X(1I2) ---+ 
Rl = R3*SIN 
R4 = X(T4) 


= X(14)-X(I3)  -- 


| 
| 
RO = R3*COS | 
| 
R2 = X(1I3)+X(TI4) | 


Ne Ne Ne Ne Ne Ne ee eS 
vs) 
Ww 
| 


X(11) <=s=s-==-==s=== + 
R4 = R2*COS 
X(I2) < + 
; R3 = R3*SIN + R2*COS -—----4 + 
; Rl = R2*SIN 
; K(T4) < + 
; R4 = R3*COS - R2*SIN 
; R3 = X(1I1)-X(T2) 
; R2 = X(I1)+X(1I2) ---+ 
; Rl = R3*SIN | 
;, X(I3) 
r 
t 


| 
R4 = X(14) | 
RO = R3*COS | 
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Mee Ne Ne 


IN_BLK: 


UBF3 
DDF3 
TE 
MES 


F 


ry 


DF3 
MES 


iuvuO 


PHNNNEZPHNHENPN 
| 


ri 


WOPHFHEZWAHHZAN 
U 
is 


LDA 
LSH 
LSH 


*AR3,R4,R3 

R4, *AR3,R2 
R2,*AR1++ 

R2, *ARO-—(IR1),R4 
R3, *AR2-- 
R4,R1,R3 
R2,*ARO,R1 

R3, *AR4-—- 
R1,RO,R4 
*AR2,*AR1,R3 
*AR1L, *AR2,R2 

R3, *++ARO (IRO),R1 
R4, *AR3++ 

*AR4,RA 

R3,*++ARO (IR1),RO 
*AR3,R4,R3 

R4, *AR3,R2 
R2,*ARL 

R2, *ARO-—(IR1),R4 
R3, *AR2 

R6,IR1 
R4,R1,R3 
R2,*ARO,R1 
R3, *AR4++(IR1) 
R1,RO,R4 
*AR1++(IR1),R2 
R4, *AR3++(IR1) 
AR5,AR1,R0 
@FFT_SIZE, RO 
INLOP 
*AR2++ (IRL 
R7,IR1 
R7,RC 
1, RS 
@LOG_SIZE,R5 
LOOP 
@SOURCE_ADDR, AR1 
1, IR0 

-1,R7 


Perform Third FFT loop 


R3 = X(1I14)-X(I3 == (S=+ 

R2 = X(1I3)+X(14 | 

X(I1) <------------- + 

R4 = R2*COS 

M{I2) < + 

R3 = R3*SIN + R2*COS -—----4 + 
Rl = R2*SIN 

X(I4) < + 
R4 = R3*COS —- R2*SIN 

R3 = X(I1)-X(I2) 

R2 = X(I1)+X(I2) ---+ 

Rl = R3*SIN 

X (13) 

R4 = X(I4) 


R3 = X(14)-X(1I3) -- 


| 
| 
| 
RO = R3*COS | 
| 
R2 = X(13)+xX(1I4) | 


X(I1) <------------- + 
R4 = R2*COS 
X(I2) < He 


Get prepared for the next 
R3 = R3*SIN + R2*COS -—----+1 t 
Rl = R2*SIN 


M{T4) < + 
R4 = R3*COS - R2*SIN 

DUMMY 

X (13) | 


LOOP BACK TO THE INNER LOOP 
DUMMY 


next stage if any left 


double step in sine table 
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, Part A: 
, — at 
; | AR1-> |__I1__ | 0 <= X({il) + X{713) 
i | cee (i 
: | AR2 _— 12. =| 2 <- 2 * X(I2) 
H | = "3 
; | AR3-> |__I3__| 4 <= Ki{il) = X(13) 
i | ee a, 
: | AR4-> |__14 | 6 <- -2 * X(T4) 
i = 7 
7 AR1-> 8 
; 9 
r 
, 
7 : 
i \1/ 
LDA @SOURCE_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
LDA AR1,AR4 
ADDI 2,AR2 
ADDI 4,AR3 
LDI @FFT_SIZE,RC 
LSH =3;,.RC 
SUBI 1,RC 
RPTBD LOOP3_A 
ADDI 6,AR4 
LDA 8, IRO 
LDA @SINE_TABLE, ARO ; ARO points at SIN/COS table 
LDF *AR3,R3 
ADDF3 R3, *AR1, RO 7; RO = X(T) + XK’ (23) =---+ 
SUBF3 R3, *AR1,R1 3; Rl = X’{T1) -— xX’ (13) -+ | 
LDF *AR4,R2 ; | 
| | STF RO, *AR1++(IRO) 2. XO (EL). < + 
MPYF -2.0,R2 j R2 = -2*xX!’ (14) --+ 
LDF *AR2,R3 ; | 
|| STF R1, *AR3++ (IRO) 7; X° (13) < | + 
MP YF 2.0,R3 ; RB = 2*xX’ (12) -, | 
LOOP3_A: STF R3, *AR2++ (IRO) ¢ XX’ (12) assesses || 
l | STF R2, *AR4++ (IRO) 3 X! (14) <--------- + 
, 
; Part: Bt 
r = 
; | 0 
; | AR1-> |__T1l__ |] 1 <- X(I1) + X(1I2) 
i | 2 
7 | AR2-> |__I2._| 3 <- X(I4) - X(I3) 
i | 4 
; | AR3-> |__13__| 5 <- [X(I1)-xX(I2) ]*COS-[X(1I3)+X(1I4)]*SIN 
i | 6 
: |_AR4 -> |__1I4 | 7 <- [X(1I1)-xX(1I2)]*SIN+[X(1I3)+xX(1I4)]*COS 
; 8 
F 
7 AR1-> 9 NOTE: COS(2*pi/8) = SIN(2*pi/8) 
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Ne Ne Ne Ne Ne 


SH 


NNPNUNPHHPNAHMPHPE PRPPPHHEE 
| Gq 

ius) 

H 


OREO pw 
K 
pal 
Ww 


\I/ 
@SOURCE_ADDR, AR1 
AR1,AR2 
AR1,AR3 
AR1,AR4 
1,AR1 
3,AR2 
5,AR3 
7,AR4 
@SINE_TABLE, ART 
@FFT_SIZE,RC 
= 3, RC 
RC, IR1 
2,RC 
*BR2,R6 
*AR3,RO 
R6,*AR1,R5 
R6,*AR1,R4 
RO,R4,R3 
RO,R4,R2 
RO, *AR4,R1 
R5, *AR1++ (IRO) 


LOOP3_B 
R2,*AR4,R5 

R1, *AR2++(IRO) 
R5,*++AR7(IR1),R1 
*AR4,R3,R2 
R2,*AR7, RO 

R1, *AR4++(IRO) 


*AR2,R6 
RO, *AR3++ (IRO) 
R6,*AR1,R5 
*AR3,RO 
R6,*AR1,R4 
RO,R4,R3 
RO,R4,R2 
RO, *AR4,R1 
R5,*AR1++(IRO) 
R2,*AR4,R5 

R1, *AR2++(IRO) 
R5,*AR7,R1 
*AR4,R3,R2 
R2,*AR7,RO 

R1, *AR4++(IRO) 
RO, *AR3 


AR7 points at SIN/COS table 


R6 = X(I2) 
RO = X(I3) 
R5 = X(I1)+X(I2) --------- + 
R4 = X(I1)-X (12) | 
R3 = X(I1)-X (12) -X (13) | 
R2 = X(I1)-X (12) +X (13) | 
Rl = X(I4)-X(I3) |-4 
X(I1) < + 
R5 = X(I1)-X (12) +X (13) +X (14) 
X(I2) < 4 
Rl = R5*SIN 
R2 = X(I1)-X(1I2)-X (13) -X (14) 
RO = R2*SIN ---+ 
X(I4) < | 

| 
R6 = X(I2) | 
X(I3) <----------- + 
R5 = X(I1)+X(I2) --------- + 
RO = X(I3) | 
R4 = X(I1)-X(1I2) | 
R3 = X(I1)-X(I2)-X(I3) | 
R2 = X(I1)-X (12) +X (13) | 
Rl = X(I4)-X(I3) |-4 
X(I1) < + 
R5 = X(I1)-X (12) +X (13) +X (14) 
X(I2) < 4 
Rl = R5*SIN < 
R2 = X(I1)-X(1I2)-X (13) -X (14) 
RO = R2*SIN 


X(I4) < 
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ST ee reer 


Perform first and second FFT loops. 
| ARI -> |__TI1 O: <= XCEL) + X(1S)- + 24%(12) 
| AR2 -> |__1I2 1 <=. XCEL) -+ XE3). = 2*x (22) 
| AR3 -—> |__I3_ | 2 <- X(I1) - X(I3) - 2*x(T4) 
|. AR4 -> |__14 3. <= X(IT1) — X(13) + 2*xX(T4) 
AR1 -> | 4 
| 
\I/ 
LDA @SOURCE_ADDR, AR1 
LDA AR1,AR2 
LDA AR1,AR3 
LDA AR1,AR4 
ADDI 1,AR2 
ADDI 2,AR3 
ADDI 3,AR4 
LDA 4,IR0 
LDI @FFT_SIZE,RC 
LSH -2,RC 
SUBI 47RC 
LDF *AR4,R6 ; R6 = X(14) 
LDF *AR2,R7 ; RT = X(I2) 
LDF *AR1,R1 ; RL = X(I1) 
MPYF 2.0,R6 ; R6 = 2 * X(T4) 
MP YF 2.0,R7 2 RY = 2 X22) 
SUBF3 R6, *AR3,R5 ; RS = X(1I3) - 2*xX(T4) 
SUBF3 R5,R1,R4 > R4 = X(1I1)-X(13)+2X(14) --+ 
SUBF3 R7, *AR3,R5 ; RS = X(13) - 2*xX(12) 
STF R4, *AR4++(IRO) ; X(I4) < + 
ADDF3 R5,R1,R3 , R3 = X(1I1)+X(13)-2xX(12) --+ 
ADDF3 R6, *AR3,R4 ; R4 = X(1I3) + 2*xX(14) 
STF R3, *AR2++(IRO) ; X(I2) < + 
¥ 
RPTBD LOOP1_2 ; 
SUBF3 R4,R1,R4 ; R4 X(I1)-X(13)-2xX(14) --+ 
ADDF3 R7, *AR3, RO ; RO = X(I3) + 2*X(12) 
STF R4, *AR3++(IRO) ; X(I3) < + 
ADDF3 RO,R1, RO ; RO = X(I1)+X(13)+2X(1I2) --+ 
F 
LDF *AR4,R6 ; R6 = X(14) 
STF RO, *AR1++(IRO) ; X(I1) < + 
MP YF 2.0,R6 ; R6 = 2 * X(14) 
LDF *AR2,R7 ; R7 = X(12) 
LDF *AR1,R1 ; Rl = X(I1) 
MP YF 2.0,R7 ; R7 = 2 * X(12) 
SUBF3 R6, *AR3,R5 7; RS = X(13) - 2*xX(T4) 
SUBF3 R5,R1,R4 ; R4 = X(1I1)-X(13)+2X(14) --+ 
SUBF3 R7, *AR3,R5 7; RS = X(1I3) - 2*xX(12) 
STF R4,*AR4++(IRO) ; X(I4) < + 
ADDF3 R5,R1,R3 > R3 = X(1I1)+xX(13)-2X(12) --+ 
ADDF3 R6, *AR3,R4 ; R4 = X(I3) + 2*X(I4) 
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Example 6—18. Real Inverse Radix-2 FFT (Continued) 


Bit reversing Type (From Source to Destination). 


Ne oNe Ne Ne Ne 


NOTE: abs(SOURCE_ADDR - DEST_ADDR) must be > FFT_SIZ 

LDI @FFT_SIZE, RO 
SUBI 2,R0 
LDA @FFT_SIZE, IRO 
LSH-1,IRO ; IRO = 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++, RI 

|| STF R1, *AR1++(IRO)B 
STF R1, *AR1++(IRO)B 
BR DIVISION; 


; In Place Bit Reversing. 


, 


Even Locations, 
@GFFT_SIZE, IRO 


lst Half Only. 


; Bit Reversing On 
IN_PLACE: LDA 


Cis 


|| STF R3, *AR2++ (IRO) ; X(I2) < + 
SUBF3 R4,R1,R4 ; R4 X(I1) =k (13) =-2K (14) ==+ 
ADDF3 R7, *AR3, RO ; RO = X(13) + 2*X(I2) 
|| STF R4, *AR3++ (IRO) ; X(I3) < + 
LOOP1_2: ADDF3 RO,R1,RO ; RO = X(I1)+X(1I3)+2X(I2) --+ 
, 
STF RO, *AR1 y LAST X({I1) < + 
, 
; Check Bit Reversing Mode (on or off) 
, 
; BIT_REVERSING = 0, then OFF (no bit reversing) 
; BIT_REVERSING <> 0, Then ON 
, 
ENDB LDI @BIT_REVERSE, RO 
BZ MOVE_DATA 
, 
; Check Bit Reversing Type. 
, 
; If SourceAddr = DestAddr, Then In Place Bit Reversing 
; If SourceAddr <> DestAddr, Then Standard Bit Reversing 
, 
LDI @SOURCE_ADDR, RO 
CMP I @DEST_ADDR, RO 
BEQ IN_PLACE 


E, this is not checked. 


Half FFT size. 
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Example 6—18. Real Inverse Radix-2 FFT (Continued) 


BITRV1: 


;Perform Bi 


BITRV2: 


C 
ens) 
ry Fa] AY 
9 


FOQunrart 
“3 


DFGT 
DFGT 


NQME 
3 
Fy] 


ct 


DI 


wn 
ian) 


DA 


SH 


OP 
OP 


DFGT 
DFGT 


3 


FPOQOURPNAPRPRPOWF HP AZSZNARPRP PPP eE 


DFGT 


LDFGT 


—2, IRO 
2,IR1 
@FFT_SIZE,RC 
-2,RC 
3, RC 
@DEST_ADDR, ARO 
ARO, AR1 

ARO, AR2 
*AR1++(IRO)B 
*AR2++(IRO)B 
*++AR0 (IRL) ,RO 
*AR1,R1 


AR1,ARO 


*AR1++ (IRO)B,R1 
*++ARO (IRL) ,RO 

RO, *ARO 

*AR1,R1 

R1, *AR2++(IRO)B 
AR1, ARO 

RO,RL 
*AR1++(IRO)B, RO 
RO, *ARO 

R1, *AR2 


Reversing Odd Locations, 


@FFT_SIZE,RC 
-1,RC 
@DEST_ADDR, ARO 
RC, ARO 

1, ARO 

ARO, AR1 

ARO, AR2 

-1,RC 

3, RC 
*AR1++(IRO)B 
*AR2++(IRO)B 
*++AR0 (IRL) ,RO 
*AR1,R1 

BITRV2 


*AR1++(IRO)B,R1 
*++AR0 (IRL) ,RO 
RO, *ARO 

*AR1,R1 

R1, *AR2++(IRO)B 
AR1, ARO 


*AR1++ (IRO)B, RO 
RO, *ARO 


; IRO = Quarter FFT size. 


; Xchange Locations only if ARO<ARI1. 


2nd Half Only 


; Xchange Locations only if ARO<ARI1. 


; STF R1,*AR2 later 
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; Perform Bit Reversing On Odd Locations, lst Half Only. 
LDI @FFT_SIZE, RC 
LSH -1,RC 
LDA RC, IRO 
LDA @DEST_ADDR, ARO 
LDA ARO, AR1 
ADDI 1,AR0 
ADDI IRO,AR1 
LSH =1,R8C 
LDA RC, IRO 
SUBI 2,RC 
RPTBD BITRV3 
STF R1, *AR2 
LDF *ARO,RO 
LDF *AR1L,R1 
LDF *++ARO (IR1),RO 
| | STF RO, *AR1++(IRO)B 
BITRV3: LDF *ARI1,R1 
| | STF R1, *-ARO (IR1) 
STF RO, *AR1 
STF R1, *ARO 
BR DIVISION 
, 
; Check Data Source Locations. 
, 
; If SourceAddr = DestAddr, Then do nothing. 
; If SourceAddr <> DestAddr, Then move data. 
, 
MOVE_DATA: LDI @SOURCE_ADDR, RO 
CMPI @DEST_ADDR, RO 
BEQ DIVISION 
LDI @FFT_SIZE, RO 
SUBI 2,R0 
LDA @SOURCE_ADDR, ARO 
LDA @DEST_ADDR, AR1 
LDF *ARO++,R1 
RPTS RO 
LDF *ARO++,R1 
| | STF R1, *AR1++ 
STF R1, *AR1 
DIVISION: LDA 2, 1RO 
LDI @FFT_SIZE, RO 
FLOAT RO 7 exp = LOG_SIZE 
PUSHF RO ; 32 MSB’S saved 
POP RO 
NEGI RO ; Neg exponent 
PUSH RO 
POPF RO ; RO = 1/FFT_SIZE 
LDA @DEST_ADDR, AR1 
LDI @FFT_SIZE, RC 
LSH -1,RC 
SUBI 2,RC 
RPTBD LAS T_LOOP 
LDA @DEST_ADDR, AR2 
NOP *AR2Q++ 
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MPYF3 RO, *AR1,R1 ; lst location 
MPYF3 RO, *AR2,R2 ; 2nd, 4th, 6th,... location 
\ | STF R1, *AR1++(IRO) 
LAST_LOOP : MPYF3 RO, *AR1,R1 s BSrd, Sth, 7th,... location 
|| STF R2, *AR2++ (IRO) 
MPYF3 RO, *AR2,R2 ; last location 
|| STF R1, *AR1 
STF R2, *AR2 
; Return to C environment 
POP DP ; Restore C environment variables. 
POP AR7 
POP AR6 
POP AR5 
POP AR4 
POPF R7 
POP R7 
POPF R6 
POP R6 
POP R5 
POP R4 
POP FP 
RETS 
.end 
* 
* No more. 
* 
KEKE KKK KK KKK KKK KK KKK KKK KEK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KK KKKKAKKKK KKK KAKA KKK 
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6.6 ’C4x Benchmarks 


Table 6—1 provides benchmarks for common DSP operations. Table 6-2 sum- 
marizes the FFT execution time required for FFT lengths between 64 and 1024 
in Example 6-12, Example 6-14, 
Example 6-17, Example 6-18, and Example 6-15. 


points for the four algorithms 


The benchmarks are given in cycles (the H1 internal processor cycle). To get 
the benchmark (time), multiply the number of cycles by the processor’s inter- 


nal clock period. For example, for a 50 MHz ’C4x, multiply by 40 ns. 


Table 6-1. ’C4x Application Benchmarks 


Application Words 
Inverse of a float (32-bit mantissa accuracy) £ 
Double-precision integer multiply 2 
Square root (32-bit mantissa accuracy) 11 
Vector dot productt 6 
Matrix Times a Vector 10 
FIR Filter 6 
IIR Filter (One Biquad) 7 
IIR Filter (N>1 Biquads) 15 
LMS Lattice Filter 11 
Inverse LPC Lattice Filter 9 
Mu-law (A-law) Compression 15 (16) 
Mu-—law (A-law) Expansion 11 (15) 


Tt Based on a modification of the matrix times a vector benchmark 
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2+6N 

1+5P 

3+ 3P 
14 (16/10) 
11/10 (15/13) 


‘C4x Benchmarks 


Table 6—2. FFT Timing Benchmarks (Cycles) 


Complex Real 
Radix-2 Radix-4 Radix-2 Forward Inverse 
Points Example 6-12 Example 6—14 Example 6-15 Example 6—17 Example 6-18 

64 2290T 1745t 1425t 75at 1012T 
128 5179T — 3336T 1683 2269t 
256 11588t 9216T 7655t 3814t 5086t 
512 25677t — 17302t 8633T 11343T 
1024 564114 472374 38945 19404T 25120t 


Assumptions: 

t The data is in on-chip RAM1. Program (.fftxt) and reserved data (.fftdat) are in on-chip RAMO. The sine/Cosine table is in on-chip 
RAMO. Bit-reversing is not considered. The cache is enabled 

+ The data is in on-chip RAM. Program (.ffttxt) and reserved data (.fftdat) are a in local(global) bus RAM with 0-wait states. Bit 
reversing is not considered. The sine/cosine table is on the global(local) bus. The cache is enabled 
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Programming the DMA Coprocessor 


The ’C4x DMA (Direct Memory Access) coprocessor is a’C4x peripheral mod- 
ule. With its six channels, the DMA maximizes sustained CPU performance by 
alleviating the CPU of burdensome I/O. Any of the six DMA channels can 
transfer data to and from anywhere in the ’C4x’s memory map for maximum 
flexibility. 


Topic Page 
7.1 Hints for DMA Programming ............0e eee e eee eee eee 7-2 
7.2._Whena DMA Channel Finishes a Transfer ...............+.++0055 7-3 
7.3. DMA Assembly Programming Examples .............00eeeeeeeee 7-4 
7.4  DMAC-Programming Examples ...........-:0eeeeeee seen eee eee 7-9 
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7.1. Hints for DMA Programming 


7-2 


The following hints will help you improve your DMA programming and also help 
you avoid unexpected results: 


cy 


Reset the DMA register before starting it. This clears any previously 
latched interrupt that may no longer exist. Also, set the DIE register (enab- 
ling interrupts for sync transfer) after starting the DMA channel. 


Take care in selecting the priority used to arbitrate between the CPU and 
DMA and also between DMA channels. If a DMA channel fails to finish a 
block transfer, it may have lower priority in a conflicting environment and 
and not be granted access to the resource. CPU/DMA rotating priority is 
considered a safe first choice. Depending on CPU/DMA execution load, 
selection of other priority schemes could result in faster code. Fine tuning 
may be needed. 


Ensure that each interrupt is received when you use interrupt synchroniza- 
tion; otherwise, the DMA will never complete the block transfer. 


For faster execution, avoid memory/resource access conflicts between 
the CPU and DMA. Carefully allocate the different sections of the program 
in memory. Use the same care with DMA autoinitialization values in 
memory. 


Try to use read/write synchronization when reading from or writing to com- 
munication ports. This avoids a peripheral-bus halt during a read from an 
empty-input FIFO or a write to a full-output FIFO. 


Choose between DMA read and write synchronization when using a DMA 
channel to transfer from one communication port to another. The ’C4x 
does not allow synchronization of DMA channel reads/writes with ICRDY/#/ 
OCRDY/j signals coming from two different communication ports (/# /) 


When your application requires initializing the primary (or auxiliary) DMA 
channel while the auxiliary (or primary) channel may still be running, halt 
the running channel by writing a halt signal to the START or AUX START 
bits. Before proceeding, check the STATUS or AUX STATUS bits of the 
running channel to ensure it has halted. This is necessary because the 
DMA halt takes place in read/write boundaries (depending on the type of 
halt issued), and the channel must wait for any ongoing read or write 
cycles to complete. When reinitializing this channel, be especially careful 
to restore its previous status exactly. For an example of how to deal with 
this situation, refer to the Designer Notebook Page, split-mode DMA re-ini- 
tialization, available through the DSP hotline. 


When a DMA Channel Finishes a Transfer 


7.2 When a DMA Channel Finishes a Transfer 


Many applications require that you perform certain tasks after a DMA channel 
has finished a block transfer. 


You can program the DMA to interrupt the CPU when this happens (TCC or 
AUX TCC bits). You can also achieve this by polling if: 


Ly 


The corresponding IIF (DMA INTx) bit is set to 1 (interrupt polling). 
This requires that the DMA control register TCC (or AUX TCC) bit be set 
first. This method does not cause any extra CPU/DMA access conflict. But 
its drawback, when using split mode, is that you cannot differentiate 
whether the primary or auxiliary channel has finished. 


The transfer counter has a zero value. This option is sometimes not reli- 
able, because the DMA channel could be in the middle of an autoinitializa- 
tion sequence. 


The TCINT (or AUX TCINT flag) is set to 1. This option is reliable, but the 
CPU is polled via the peripheral bus, potentially causing CPU/DMA ac- 
cess conflict if the DMA is operating to/from the peripheral bus. This is a 
good option if you do not foresee any problem with the additional access 
delay. 


The START (AUX START) bits in the DMA channel control register are 
set to 105. This option can also cause a CPU/DMA access conflict. 
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7.3. DMA Assembly Programming Examples 


The DMA coprocessor is a memory-mapped peripheral that you can easily 
program from C as well as from assembly. Example 7—1 through Example 7-5 
provide examples on programming the DMA coprocessor using assembly lan- 
guage. Example 7-6 through Example 7-11 provide examples on program- 
ming the DMA coprocessor from C. The source code for examples 
Example 7-6 through Example 7—11 can be found in the TI BBS (self-extract- 
ing file: C4xdmaex.exe). 


Example 7—1 shows one way for setting up DMA channel 2 to initialize an array 
to zero. This DMA transfer is set up to have priority over a CPU operation and 
to generate an interrupt flag, DMA INT2, after the transfer is completed. The 
DMA control register is set to 00C4 0007h. 


Example 7-1.Array initialization With DMA 


* 
* ITLE ARRAY INITIALIZATION WITH DMA 
* 
* HIS EXAMPLE INITIALIZES A 128 KEMENTS ARRAY TO ZERO. THE DMA 
* TRANSFER IS SET UP TO HAVE HIGHER PRIORITY OVER CPU OPERATION. 
* THE DMA INT2 INTERRUPT FLAG IS SET TO 1 AFTER THE TRANSFER IS 
* COMPLETED. 
* 
-data 
DMA2 -word O0O01000COH ;DMA channel 2 map address 
CONTROL .word 00C40007H ;DMA register initialization data 
SOURCE -word ZERO 
SRC_IDX .word 0 
COUNT .word 128 
DESTI .word ARRAY 
DES_IDX .word 1 
ZERO .float 0.0 ;Array initialization value 0.0 
.bss ARRAY, 128 
text 
START LDP @DMA2 ;Load data page pointer 
LDA @DMA2, ARO ;Point to DMA channel 2 registers 
LDI @SOURCE, RO ;Initialize DMA source register 
STI RO, *+ARO (1) 
LDI @SRC_IDX, RO ; Initialize DMA source index register 
STI RO, *+ARO (2) 
LDI @COUNT, RO ; Initialize DMA count register 
STI RO, *+ARO (3) 
LDI @DESTIN, RO ; Initialize DMA destination register 
STI RO, *+ARO (4) 
LDI @DES_IDX,RO ; Initialize DMA destination index register 
STI RO, *+ARO (5) 
LDI @CONTROL, RO ;Start DMA channel 2 transfer 
STI RO, ARO 
end 


The DMA transfer can be synchronized with external interrupts, communica- 
tion-port ICRDY/OCRDY signals, and timer interrupts. In order to enable this 
feature, the SYNCH MODE field, bits 6—7, of the DMA-control register must be 


DMA Assembly Programming Examples 


configured to a proper value, and the corresponding bits of the DMA-interrupt 
enable (DIE) register must be set. Example 7—2 sets up DMA channel 4 read 
synchronization with the communication-port 4 ICRDY signal. The DMA con- 
tinuously transfers data from the communication-port input register until the 
START field, bits 22-23 of the DMA control register, is changed by the CPU. 


Example 7—2.DMA Transfer With Communication-Port ICRDY Synchronization 


* 
* ITLE DMA TRANSFER WITH COMMUNICATION PORT ICRDY 
* SYNCHRONIZATION 
* 
* HIS EXAMPLE SETS UP DMA CHANNEL 4 TO RANSFER DATA FROM 
* COMMUNICATION PORT INPU REGISTER TO INTERNAL RAM WITH ICRDY 
* SIGNAL READ SYNCHRONIZATION. HE TRANSFER MODE OF THE DMA IS 
* SE TO 00 HEREFORE THE TRANSFER WON’ STOP UNTIL THE START 
* BITS OF THE DMA CONTROL REGISTER IS CHANGED. 
x data 
DMA4 word OO1LOO0EOH ;DMA channel 4 map address 
CONTROL word 00C00040H ;DMA register initialization data 
SOURCE word 00100081H 
SRC_IDX word 0 
COUNT word 0 ;Transfer counter is set to largest value 
DESTIN word OO2FF800H 
DES_IDX word 1 
~Cext 
START LDP @DMA4 ;Load data page pointer 
LDA @DMA4, ARO ;Point to DAM channel 4 registers 
LDI @SOURCE, RO ; Initialize DMA source register 
STI RO, *+ARO (1) 
LDI @SRC_IDX, RO ;Initialize DMA source index register 
STI RO, *+ARO (2) 
LDI @COUNT, RO ; Initialize DMA count register 
STI RO, *+ARO (3) 
LDI @DESTIN, RO ; Initialize DMA destination register 
STI RO, *+ARO (4) 
LDI @DES_IDX, RO ; Initialize DMA destination index register 
STI RO, *+ARO (5) 
LDI @CONTROL, RO ;Start DMA channel 4 transfer 
Sit RO, *ARO 
LDHI 010H,DIE ;Enable ICRDY 4 read sync. 
.end 


If external interrupt signals are used for DMA transfer synchronization, then 
pins IIOFO-3 must be configured as interrupt pins. 


The ’C4x DMA split mode is another way besides memory-map address to 
transfer data from/to the communication port. When the split-mode bit of the 
DMA control register is set, the DMA is separated into primary and auxiliary 
channels. The primary channel transfers data from memory to the commu- 
nication-port output register, and the auxiliary channel transfers data from the 
communication port to memory. The communication-port number is selected 
in bits15—17 of the DMA control register. 


Example 7-3 shows how to set up DMA channel 1 into split mode. The DMA 
primary channel transfers data from internal RAM to communication port 3 
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through external interrupt INT2 synchronization and bit-reversed addressing. 
The DMA auxiliary channel transfers data from communication port 3 to inter- 
nal RAM via external interrupt INT3 synchronization and linear addressing. 


Example 7—3.DMA Split-Mode Transfer With External-Interrupt Synchronization 


+ + + + FF HHH 


TITLE 


DMA SPLIT-MODE TRANSFER WITH EXTERNAL INTERRUPT SYNCHRONIZATION 


THIS EXAMPLE SETS UP DMA CHANNEL 1 TO SPLIT-MODE. THE PRIMARY CHANNEL TRANSFERS 
DATA FROM INTERNAL RAM TO COMM PORT 3 OUTPUT REGISTER WITH EXTERNAL INTERRUPT 
INT2 SYNCHRONIZATION AND BIT-REVERSED ADDRESSING. THE AUXILIARY CHANNEL TRANSFERS 
DATA FROM COMMUNICATION PORT 3 INPUT REGISTER TO INTERNAL RAM WITH EXTERNAL 
INTERRUPT INT3 SYNCHRONIZATION AND LINEAR ADDRESSING. 


-data 
.word 
.word 
.word 
.word 
.word 
.word 


001000B0H ;DMA channel 1 map address 

O3CDDOD4H ;DMA register initialization data 
OO02FFCOOH 

08H ;The same value as IRO for bit-—reversed 

8 

O002FF800H 

it 

8 -text 

@DMA1 ;Load data page pointer 

@DMA1, ARO 7Point to DAM channel 1 registers 

@SOURCE, RO ;Initialize DMA primary source register 
RO, *+ARO (1) 

@SRC_IDX, RO ;Initialize DMA primary source index register 
RO, *+ARO (2) 

@COUNT, RO ;Initialize DMA primary count register 

RO, *+ARO (3) 

@DESTIN, RO ; Initialize DMA aux destination register 
RO, *+ARO (4) 

@DES_IDX, RO ;Initialize DMA aux destination index register 
RO, *+ARO (5) 

@AUC_CNT, RO ;Initialize DMA auxiliary count register 
RO, *+ARO (7) 

@CONTROL, RO ; Start DMA channel 1 transfer 

RO, *ARO 

01100H,IIF ;Configure INT2 and INT3 as interrupt pins 
OAOH,DIE ;Enable INT2 read and INT3 write sync. 
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An advantage of the ’C4x DMA is the autoinitialization feature. This allows you 
to set up the DMA transfer in advance and makes the DMA operation com- 
pletely independent from the CPU. When the DMA operates in autoinitializa- 
tion mode, the link pointer and auxiliary link pointer initialize the registers that 
control the DMA operation. The link pointer can be incremented (AUTOINIT 
STATIC = 0) during autoinitialization or held constant (AUTOINIT STATIC = 1) 
during autoinitialization. This option allows autoinitialization values to be 
stored in sequential memory locations or in stream-oriented devices such as 
the on-chip communication ports or external FIFOs. When DMA SYNC MODE 
is enabled, The DMA autoinitialization operation can be configured to synchro- 
nize with the same signal. Example 7-4 sets up DMA channel 0 to wait for the 
communication port to input the initialization value. After DMA autoinitializa- 
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tion is complete, the DMA channel starts transferring data from the communi- 
cation port input register to internal RAM. 


Example 7—4.DMA Autoinitialization With Communication Port ICRDY 


* 
* ITLE DMA AUTOINITIALIZATION WITH COMMUNICATION PORT ICRDY 
* 
bad HIS EXAMPLE SETS UP DMA CHANNEL O TO WAIT FOR COMMUNICATION 
* PORT TO INPU THE INITIALIZATION VALUE. HE DMA AUTOINITIAL-— 
* IZATION AND TRANSFER ARE BOTH DRIVEN BY ICRDY 0 FLAG. AFTER 
* DMA AUTOINIT IS COMPLETED, HE DMA CHANNEL STARTS TRANSFERRING 
x DATA FROM COMM PORT INPU REGISTER TO INTERNAL RAM WITH ICRDY 
* Q READ SYNCHRONIZATION. HE VALUES IN CO PORT O INPUT FIFO 
* SHOULD BE: 
* 
* SEQUENCE | VALUE 
* + 
* 1 00C40047H (STOP AFTER TRANSFER COMPLETED) 
* OR 00C4054BH (REPEAT AFTER TRANSFER COMPLETED) 
* 2 001000414 
* 3 OH 
* 4 20H 
* 5 002FF800H 
% 6 1H 
ig 7 00100041H 
* 
.data 
DMAO .word OO1000A0H ;DMA channel O map address 
DMA_INIT .word 0004054BH ;DMA initialization control word 
LINK .word 00100041H ;Comm port input register address 
DMA_START .word 00C4054BH ;DMA start control word 
~text 
START LDP @DMAO ;Load data page pointer 
LDA @DMAO, ARO ;Point to DMA channel 0 registers 
LDI @DMA_INIT, RO ;Initialize DMA control register 
old RO, *ARO 
LDI @LINK, RO ;Initialize DMA link pointer 
STI RO, *+ARO (6) 
LDI @DMA_START, RO ;Start DMA channel O transfer 
Sit RO, *ARO 
LDI 01H, DIE ;Enable ICRDY O read sync. 
end 


The DMA autoinitialization and transfer continues executing if the DMA autoin- 
itialization is still enabled. Therefore, a DMA setup like the one in Example 7—4 
can make it possible for an external device to control the DMA operation 
through the communication port. 


With the autoinitialization feature, the "C4x DMA coprocessor can support a 
variety of DMA operations without slowing down CPU computation. A good ex- 
ample is a DMA transfer triggered by one interrupt signal. Usually, this is imple- 
mented by starting a DMA activity with a CPU interrupt service routine, but this 
utilizes CPU time. However, as shown in Example 7-5, you can set up a single 
interrupt-driven dummy DMA transfer with autoinitialization. When the inter- 
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rupt signal is set, the DMA will complete the dummy DMA transfer and start 
the autoinitialization for the desired DMA transfer. 


Example 7-5. Single-Interrupt-Driven DMA Transfer 


* 
* ITLE SINGLE INTERRUPT-DRIVEN DMA TRANSFER 
* 
* HIS EXAMPLE SETS UP A DUMMY DMA TRANSFER FROM INTERNAL RAM 
* O THE SAME MEMORY WITH EXTERNAL INT 0 SYNCHRONIZATION AND 
* BRUTOINITIALIZATION FOR TRANSFERRING 64 DATA FROM LOCAL MEMORY 
* TO INTERNAL RAM. AFTER THE SECOND TRANSFER IS COMPLETED, THE 
* DMA IS RE-INITIALIZED TO FIRST DMA TRANSFER SETUP. 
* 
-data 
DMA5 -word OO1OOOFOH ;DMA channel 5 map address 
DMA_INIT .word 0000004BH ;DMA initialization control word 
LINK .word DMA1 jist DMA link list address 
DMA_START .word 0O0CO004BH ;DMA start control word 
DMA1 .word O0O0CO004BH ;1lst dummy DMA transfer link list 
.word OO2FF800H 
-word 00000000H 
.word 00000001H 
.word OO2FF800H 
-word 00000000H 
.word DMA2 
DMA2 .word O0OC4000BH ;The desired DMA transfer link 
-word 00400000H ;list 
.word 00000001H 
-word 00000040H 
.word OO02FF800H 
.word 00000001H 
.word DMA1 
ext 
START LDP @DMA5 ;Load data page pointer 
LDA @DMA5, ARO ;Point to DMA channel 5 registers 
LDI @DMA_INIT, RO ;Initialize DMA control register 
STI RO, *ARO 
LDI @LINK, RO ; Initialize DMA link pointer 
STI RO, *+ARO (6) 
LDI @DMA_START,RO ;Start DMA channel 5 transfer 
STI RO, *ARO 
LDI 01H, IIF ;Configure INTO as interrupt pins 
LDHI 0800H, DIE ;Enable INT 0 read sync. for 
;DMA channel 5 
end 
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7.4 DMA C-Programming Examples 


Example 7-6 to Example 7-11 includes DMA programing examples from C. 
These examples cover unified and Split mode, DMA autoinitialization and 
DMA synchronization operations. Descriptions of the examples presented are 
as follows: 


[j Example 7-6: Unified-mode DMA transfers data between commports us- 
ing read sync. 


(1 Example 7—7: Unified-mode DMA uses autoinitialization (method 1) to 
transfer 2 data blocks. 


[j Example 7-8: Unified-mode DMA uses autoinitialization (method 2) to 
transfer 2 data blocks. 


[) Example 7-9: Split-mode auxiliary DMA transfers data between comm- 
ports using read sync. 


[J Example 7-10: Split-mode auxiliary and primary channel send/receive 
data to and from commport 


[1 Example 7-11: Split-mode DMA autoinitializes both auxiliary and primary 
channels (auxiliary transfers 1 block and primary transfers 2 blocks) 


Example 7-12 is the include file for all examples (dma.h). 
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Example 7-6. Unified-Mode DMA Using Read Sync 


[KI KK KK KK IK I IK KIA I I IA A A A IA A I IA A IA A A IA I I IA I IK I OK 
EXAMPLE: Unified-mode 
Commport-—to-commport transfer: 
DMA3 in unified mode transfers 8 words from commport 3 to commport 0. 
DMA3 source sync with ICRDY3 is used. 
Note: Writes cannot be synchronized with OCRDYO, because a DMA i can 
only be synchronized with signals coming commport i. You could sync 
on ICRDY3 or on OCRDYO, not both (the choice depends on the specific 
application to avoid deadlock). 
In this program, DMA3 expects data in commport 3 being sent by 


another processor/device. Otherwise no transfer will occur. 
Kk a a a a a A A A A A A A A a A A A a a a a a a a a / 


include "dma.h” 

define DMAADDR 0x001000d0 

define CTRLREG 0x00c40045 /* DMA sends interrupt to CPU when transfer 
finishes (TC=1),DMA-CPU rotating priority */ 

define SRC 0x00100071 /* src = commport 0 input fifo */ 

define SRC_IDX 0x0 /* src address does not increment */ 

define COUNTER 0x08 /* number of words to transfer */ 

define DST 0x00100042 /* dst = commport 3 output fifo */ 

define DST_IDX 0x0 /* dst address does not increment */ 

define DIEVAL 0x4000 /* set ICRDY3 read sync */ 


DMAUNIF *dma = (DMAUNIF *) DMAADDR; 
int dieval = DIEVAL; 


main() { 


dma->srce = (void *)SRC; 
dma->src_idx = SRC_IDX; 


dma->counter = COUNTER; 
dma->dst = (void *)DST; 
dma->dst_idx = DST_IDX; 
dma->ctrl = (void *)CTRLREG; 


asm(” ldi @ _dieval, die”); 
PRIM _WAIT_DMA( (volatile int *)dma); 
} 
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Example 7—7.Unified-Mode DMA Using Autoinitialization (Method 1) 


[ROKK KKK IKK IK IK IKK IK I IK IR IK IK IR IK I IK IR I I IR I A IR IIR IK IK OK I 


EXAMPLE: 


main () 


/* 
au 
au 
au 
au 
au 
au 
au 


/* 
au 
au 
au 
au 
au 
au 


/* initialize DMA (link pointer pointing to lst set of autoinit. values */ 


dma->linkp = g&autoinil; 
dma->counter = 0; 
dma->ctr1 = (volatile void *)CTRLREG1; 


/* wait for DMA to finish transfer */ 
PRIM_WAIT_DMA((volatile int *)dma)j; } 


include "dma.h” 

define DMAADDR 0x001000a0 

/* 1st transfer settings */ 

define CTRLREG1 0x00c00009 /* DMA-CPU rotating priority and DMA 
autoinitializes when transfer counter = 0 */ 

define SRC1 Ox002ffc00 /* src address */ 

define SRC1_IDX Ox1 /* src address increment */ 

define COUNTER1 0x08 /* number of words to transfer */ 

define DST1 Ox002ffd00 /* dst address rt 3 output fifo */ 

define DST1_IDX Ox1 /* dst address increment */ 


/* 2nd transfer settings */ 
define CTRLREG2 0x00c40005 /* DMA sends interrupt to CPU when transfer 


define 
define 
define 
define 
define 
DMAUNIF 
DMAUNTIF 
DMAUNIF 


initi 


Eoinil. 
Eoin I. 
Eoinil. 
L ast = {void *)DST1; 


toini 


toinil. 
toinil. 
toinil. 


initi 


toini2. 
toini2. 
toini2. 
toini2. 
Foini2 


toini2. 


to transfer N blocks and starts with a DMA transfer counter equals to 0. 
Re ie i i i ee ee a eee ee ee ef 


Unified Mode 

Autoinitialization method 1: 

DMAO in unified mode transfers 8 words from 0x02ffCOO (index 1) to 
OxO02ffd00 (index 1) and then it transfer 4 words from 0x02ffe00 (index 4) 
to to Ox02fff00 (index 1). No DMA sync transfer is used. 
Autoinitialization method 1 requires N autoinitialization memory blocks 


finishes (TC=1),DMA-CPU rotating priority 
and DMA stops after transfer completes */ 


SRC2 0x002ffe00 /* src address */ 

SRC2_IDX Ox4 /* src address increment */ 
COUNTER2 Ox4 /* number of words to transfer */ 
DST2 Ox002fff00 /* dst address */ 

DST2_IDX Oxl1 /* dst address increment */ 

*dma = (DMAUNIF *)DMAADDR; 


autoinil; 
autoini2; 


alize lst set of autoinitialization values * 
src (void *)SRC1; 

src_idx = SRC1_IDX; 

counter = COUNTER]; 


dst_idx = DST1_IDX; 
linkp = g&autoini2; 


ctrl = (void *)CTRLREG1; 
alize 2nd set of autoinitialization values aa 
sre = (yoid *)SRC2; 


src_idx = SRC2_IDX; 
counter = COUNTER2; 


dst = (void *)DST2; 
dst_idx = DST2_IDX; 
etel = (void *)CTRLREG2; 
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Example 7-8. Unified-Mode DMA Using Autoinitialization (Method 2) 


[1% OK HK KK KK KK I IK I I IA A A IA A I A I IA A I A IA I IA IA I I I A I IK 


EXAMPLE: Unified Mode 
Autoinitialization method 2: 
DMAO in unified mode transfers 8 words from 0x02ffC00 (index 1) 
to 0x02ffd00 (index 1) and then it transfer 4 words from 0x02ffe00 
(index 4) to to 0x02fff00 (index 1). No DMA sync transfer is used 
Autonitialization method 2 requires (N-1) autoinitialization memory 
blocks to transfer N blocks and starts with a DMA transfer counter 


different from 0. 
eK KA A A IA A A I IA I I A A I A IA IA A IA I IA A A A A A Ta I He He / 


include "“dma.h” 


define DMAADDR 0x001000a0 

/* 1st transfer settings */ 

define CTRLREG1L 0x00c00009 /* DMA-CPU rotating priority and DMA 
autoinitializes when transfer counter = 0 */ 

define SRC1 Ox002ffc00 /* src address */ 

define SRC1_IDX Oxl /* src address increment */ 

define COUNTER1 0x08 /* number of words to transfer */ 

define DST1 Ox002ffd00 /* dst address rt 3 output fifo */ 

define DST1_IDX Oxl /* dst address increment */ 


/* 2nd transfer settings */ 

define CTRLREG2 0x00c40005 /* DMA sends interrupt to CPU when transfer 
finishes (TC=1),DMA-CPU rotating priority 
and DMA stops after transfer completes */ 


define SRC2 Ox002ffe00 /* src address */ 

define SRC2_IDX 0x4 /* src address increment */ 
define COUNTER2 Ox4 /* number of words to transfer */ 
define DST2 Ox002fff00 /* dst address */ 

define DST2_IDX Oxl /* dst address increment */ 


DMAUNIF *dma = (DMAUNIF *)DMAADDR; 
DMAUNIF autoini2; 


main() { 


/* initialize 2nd set of autoinitialization values */ 
autoini2.srce (void *)SRC2; 

autoini2.src_idx = SRC2_IDX; 

autoini2.counter = COUNTER2; 

autoini2.dst = (void *)DST2; 

autoini2.dst_idx = DST2_IDX; 

autoini2.ctrl = (void *)CTRLREG2; 


/* initialize DMA with lst set of autoinitialization values */ 
dma->srec = (void *)SRC1; 

dma->src_idx = SRC1_IDX; 

dma->counter = COUNTERI; 


dma->dst = (void *)DST1; 
dma->dst_idx = DST1_IDX; 
dma->linkp = &€autoini2; 
dma->ctrl = (void *)CTRLREG1; 


/* wait for DMA to finish transfer */ 
PRIM _WAIT_DMA((volatile int *)dma); 
} 
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Example 7-9. Split-Mode Auxiliary DMA Using Read Sync 


[BRK KK KK RK A A A A A AR A A RA RO RR OR OR OR OK OK 


EXAMPLE: Split-mode (AUX only) 
Commport—to-commport transfer: 
DMA 3 Auxiliary channel transfers 8 words from commport 3 to 
commport 0. DMA3 source sync with ICRDY3 is used. 
This example is functionally equivalent to Example 7-7. 
In this program, DMA3 expects data in commport 3 being sent by 


another processor/device. Otherwise no transfer will occur. 
KEK KKK KKK KKK KKK KKK KKK KKK KK KKK KKK KK KKK KKK KKK KKK KKK KEK KKK KKK KKK KKK KKK KKKKKKAKK KK KKK 


jp 

include "dma.h” 

define DMAADDR 0x001000d0 

define CTRLREG 0x0309c091 /* DMA Aux sends interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority */ 

define DST 0x00100042 /* dst = commport 3 output fifo */ 

define DST_IDX 0x0 /* dst address does not increment */ 

define DIEVAL 0x4000 /* set ICRDY3 Auxiliar read sync */ 

define ACOUNTER 0x08 /* auxiliar channle counter */ 


DMASPLIT *dma = (DMASPLIT *)DMAADDR; 
int dieval = DIEVAL; 


main() { 


dma->dst = (void *)DST; 
dma->dst_idx DST_IDX; 
dma->acounter = ACOUNTER; 
dma->ctrl = (void *)CTRLREG; 
asm(” ldi @ dieval,die”); 
AUX_WAIT_DMA( (volatile int *) dma); 
} 
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Example 7-10. Split-Mode Auxiliary and Primary Channel DMA 


[8 KK KK a a a I A A A A A A A A A a A a a a a Tk a a oe 


EXAMPLE: Split-mode (AUX and PRIMARY both running) 
Commport-to-commport transfer: 
DMA3 prim. channel sends 4 words from memory (0x02ffc00) to 
commport 3 (output FIFO). 
DMA3 aux.channel receives 8 words from commport 3 (input FIFO) 
to memory (0x02ffd00) 
DMA3 prim. channel uses OCRDY3 write sync. 
DMA3 aux. channel uses ICRDY3 read sync. 
In this program, DMA3 aux channel expects data in commport 3 being 
sent by another processor/device. Otherwise no aux channel transfer 


will occur. 
eR kA A A A I A A A IA I IA IA IA I I IA I IA IA IA A A I A A A A A A a I HI He / 


include “dma.h” 
define DMAADDR 0x001000d0 
define CTRLREG Ox03cdc0d5 /* DMA Aux/prim send interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority, read/write sync transfer */ 
define DIEVAL 0x24000 /* set ICRDY3/OCRDY read/write sync */ 
define DST OxO02ffd00 /* auxiliary channel settings */ 
define DST_IDX Oxl 
define ACOUNTER 0x08 
define SRC Ox02ffc00 /* primary channel settings */ 
define SRC_IDX Ox1 
define COUNTER 0x04 
DMASPLIT *dma = (DMASPLIT *)DMAADDR; 
int dieval = DIEVAL; 
main() { 
dma->sre = (void *)SRC; /* primary channel */ 
dma->src_idx = SRC_IDX; 
dma->counter = COUNTER; 
dma->dst = (void *)DST; /* auxiliary channel */ 
dma->dst_idx = DST_IDX; 
dma->acounter = ACOUNTER; 
dma->ctrl = (void *)CTRLREG; 


asm(” ldi @_dieval, die”); 
SPLIT_WAIT_DMA( (volatile int *)dma); 
} 
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Example 7-11. Split-Mode DMA Using Autoinitialization 


[BRK IK KR A RAR AR A RAR AR AR A RAR AR A RA RA RA A A A A A IA I 


EXAMPLE : Split-mode (AUX and PRIMARY both running) 
Autoinitialization example: 
DMA3 aux .channel autoinitializes and THEN receives 4 words from 
commport 3 (input FIFO) to memory (0x02ffd00). 
DMA3 pri.channel sends 4 words from memory (0x02ffc00) to 
commport 3 (output FIFO) and THEN other 2 words from memory 
(Ox02ffc10) with index=2 to commport 3 (output FIFO). 
DMA3 prim. channel uses OCRDY3 write sync. 
DMA3 aux. channel uses ICRDY3 read sync. 
Autoinitialization method 1 is used in all cases. 
In this program, DMA3 aux channel expects data in commport 3 being 
sent by another processor/device. Otherwise no aux channel transfer 
will occur. 

FER A RR A A A A A A A AR A A A AA RAR A OR I OR Re / 


include "“dma.h” 


define DMAADDR 0x001000d0 

define CTRLREG1 Ox03cdc0e9 /* DMA aux/prim send interrupt to CPU when 
transfer finishes (TC=1),DMA-CPU rotating 
priority, read/write sync transfer */ 

define CTRLREG2 Ox03cdc0d5 /* same as above but transfer finishes */ 

define DIEVAL 0x24000 /* set ICRDY3/OCRDY read/write sync */ 

/* Primary Channel */ 

define SRC1 Ox02ffc00 /* autoinitialization 1 */ 

define SRC1_IDX Ox1 

define COUNTERI1 0x04 

define SRC2 Ox02ffc10 /* autoinitialization 2 */ 

define SRC2_IDX Ox2 

define COUNTER2 0x02 


/* Auxiliary channel */ 


define DST1 Ox02ffd00 /* autoinitialization 1 */ 
define DST1_IDX Ox1 
define ACOUNTER1 0x04 

DMASPLIT *dma = (DMASPLIT *)DMAADDR; 

int dieval = DIEVAL; 


DMAPRIM autoinil, autoini2; 
DMAAUX autoiniaux; 


main() { 


/* PRIMARY CHANNEL : 1st autoinitialization values */ 
autoinil.ctrl = (void *)CTRLREG1; 

autoinil.sre = (void *)SRC1; 

autoinil.src_idx = SRC1_IDX; 

autoinil.counter = COUNTER1; 

autoinil.linkp = &autoini2; 
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Example 7-11. Split-Mode DMA Using Autoinitialization (Continued) 


/* PRIMARY CHANNEL : 2nd autoinitialization values */ 


autoini2.ctrl = (void *)CTRLREG2; 

autoini2.sre = (void *)SRC2; 

autoini2.src_idx = SRC2_IDX; 

autoini2.counter = COUNTER2; 

/* AUXILIARY CHANNEL : 1st autoinitialization values */ 
autoiniaux.ctrl = (void *)CTRLREG2; 

autoiniaux.dst = (void *)DST1; 


autoiniaux.dst_idx = DST1_IDX; 
autoiniaux.acounter = ACOUNTERI1; 


/* initialize DMA */ 


dma->linkp = géautoinil; 
dma->alinkp = &autoiniaux; 
dma->counter = 0; 

dma->acounter = 0; 

dma->ctrl = (void *)CTRLREG1; 


asm(” ldi @ _dieval,die”); 


/* wait for DMA to finish transfer */ 
SPLIT_WAIT_DMA( (volatile int *)dma); 
} 
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Example 7-12. Include File for All C Examples (dma.h) 


typedef struct dmaunif{ 
volatile void *ctrl; /* control register */ 
volatile void *src; /* source address Ao 
volatile int src_idx; /* source address index */ 
volatile int counter; /* transfer counter */ 
volatile void *dst; /* dest. address 7 
volatile int dst_idx; /* dest. address index */ 
struct dmaunif *linkp; /* link pointer */ 
} DMAUNIF; 

typedef struct dmaprim{ 
volatile void *ctrl; /* control register */ 
volatile void *src; /* prim. src address Hf 
volatile int src_idx; /* prim. index * / 
volatile int counter; /* prim transfer counter*/ 
struct dmaprim *linkp; /* link pointer Ae 
}DMAPRIM; 

typedef struct dmaaux{ 
volatile void *ctrl; /* control register */ 
volatile void *dst; /* aux. dst address x7 
volatile int dst_idx; /* aux. index */ 
volatile int acounter; /* aux. transfer counter*/ 
struct dmaaux *alinkp; /* aux. link pointer */ 
} DMAAUX; 

typedef struct { 
volatile void *ctrl; /* control register */ 
volatile void *src; /* prim. src address */ 
volatile int src_idx; /* prim. index * / 
volatile int counter; /* prim transfer counter*/ 
volatile void *dst; /* aux. dst address */ 
volatile int dst_idx; /* aux. index */ 
struct dmaprim *linkp; /* link pointer xf: 
volatile int acounter; /* aux. transfer counter*/ 
struct dmaaux *alinkp; /* aux. link pointer */ 
} DMASPLIT; 

#define PRIM_WAIT_DMA (x) while ((0x00c00000 & *x) !=0x00800000) 

#define AUX_WAIT_DMA (x) while ((0x03000000 & *x) !=0x02000000) 

#define SPLIT _WAIT_DMA(x) while ((0x03c00000 & *x) !=0x02800000) 


Programming the DMA Coprocessor 7-17 


7-18 


Chapter 8 


Using the Communication Ports 


The ’C4x communication ports are very high-speed data transmission circuits. 
Their speed and the close proximity of multiple data lines create special chal- 
lenges. General design rules that are applicable to high-speed (<10ns) 
memory interface design are appropriate for ’C4x communication-port inter- 
connections. This chapter provides guidelines for designing communication- 
port interfaces. 
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8.1 Communication Ports 
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To provide simple processor-to-processor communication, the ’C4x has six 
parallel bidirectional communication ports. Because these ports have port ar- 
bitration units to handle the ownership of the communication-port data bus be- 
tween the processors, you should concentrate only on the internal operation 
of the communication ports. For software, these communication ports can be 
treated as 32-bit on-chip data I/O FIFO buffers. Processor read data from/write 
data to communication is simple: 


LDI @comm_portO_input,RO ;Read data from comm. port 0 
or 
STI RO, @comm_portO_output ;Write data to comm. port 1 


If the CPU or DMA reads from or writes to the communication-port I/O FIFO 
and the I/O-FIFO is either empty (on a read) or full (on a write), the read/write 
execution will be extended either until the data is available in the input FIFO 
for a read, or until the space is available in the output FIFO for a write. Some- 
times, you can use this feature to synchronize the devices. However, this can 
slow down the processing speed and even hang up the processor. Avoid such 
situations by synchronizing the CPU/DMA accesses with the following flags 
that indicate the status of the port: 


ICRDY (input channel ready) 
= 0, the input channel is empty and not ready to be read. 
= 1, the input channel contains data and is ready to read. 


ICFULL (input channel full) 
= 0, the input channel is not full. 
= 1, the input channel is full. 


OCRDY (output channel ready) 
= 0, the output channel is full and not ready to be written. 
= 1, the output channel is not full and ready to be written. 


OCEMPTY (output channel empty) 
= 0, the output channel is not empty. 
= 1, the output channel is empty. 


Example 8-1 shows the reading of data from the communication port, eight 
data at a time using the CPU ICFULL interrupt. Example 8—2 shows the writing 
of data to a communication port, one datum at a time using the polling method. 
Both examples show DMA reads/writes. (DMA is discussed in subsection 7.3, 
DMA Assembly Programming Examples on page 7-4. 


Communication Ports 


Example 8—1.Read Data from Communication Port With CPU ICFULL Interrupt 


* 
* ITLE READ DATA FROM COMMUNICATION PORT WITH CPU 
* ICFULL INTERRUP 
* 
* HIS EXAMPLE ASSUMES THE ICFULL O INTERRUPT VECTOR IS SET IN THE CPU 
* INTERRUPT VECTOR TABLE. THE EIGHT DATA WORDS ARE READ I 
* WHENEVER HE DATA IS FULL IN COMM PORT 0 INPUT FIFO. 
* 
LDA @COMM_PORTO_CTL, AR2 ;Load comm port 0 control Reg. address 
LDA @COMM_PORTO_INPUT,ARO ;Load comm port 0 input FIFO address 
LDA @INTERNAL_RAM, AR1 ;Load internal RAM address 
AND3 OF7H, *AR2,R9 ;Unhalt comm port 0 input channel 
Sit R9, *AR2 
OR 04H, IIE ;Enable ICRDY 0 interrupt 
OR 02000H,ST ;Enable CPU global interrupt 
ICFULLO PUSH Ss. 
PUSH RS 
PUSH RE 
PUSH RC 
LDI *ARO,R10O ;Read data from comm port O input 
RPTS 6 ;Setup for loop READ 
READ LDI *ARO,R1O ;Read data from comm port 0 input 
1 | STI R10, *AR1++ (1) ;Store data into internal RAM 
STI R10, *AR1++ (1) ;Store data into internal RAM 
POP RC 
POP RE 
POP RS 
POP ST 
RETI 
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Example 8—2.Write Data to Communication Port With Polling Method 


* 

* ITLE WRITE DATA TO COMMUNICATION PORT WITH POLLING METHOD 

* 

* HE BIT 8 OF COMMUNICATION PORT 0 CONTROL REGISTER WILL BE 

* SET ONLY WHEN THE OUTPUT FIFO IS FULL. THIS EXAMPLE CHECKS 

* HIS BIT TO MAKE SURE THERE IS SPACE AVAILABLE IN 

* OUTPUT FIFO. 

* 
LDA @COMM_PORTO_CTL, AR2 ;Load comm port 0 control reg address 
LDA @COMM_PORTO_OUTPUT,ARO ;Load comm port O output FIFO address 
LDA @INTERNAL_RAM, AR1 ;Load internal RAM address 
AND3 OEFH, *AR2,R9 ;Unhalt comm port 0 output channel 
Sli R9, *AR2 
LDI 0100H,R9 ;Load mask for bit 8 

WAIT: TSTB *AR2,R9 ;Check if output FIFO is full 
BZD WAIT ;If yes, check again 

WRITE_COMM LDI *AR1++(1), R10 ;Read data from internal RAM 
STI R10, *ARO ;Store data into comm port 0 output 
NOP 


8-4 


Signal Considerations 


8.2 Signal Considerations 


Because of the bidirectional high-speed protocol used in the *C4x communica- 
tion ports, signal quality is extremely important. Poor quality signals can poten- 
tially cause both ends of a communication-port link to become a master. If this 
occurs and one communication port drives a signal request, no response is 
received from the other communication port, and the link hangs. This condition 
remains until both ’C4x devices are reset. If this is not corrected, the commu- 
nication-port drivers can be damaged. 


If poor quality signals are a problem, use circuits to improve impedance match- 
ing. Because the ’C4x communication-port output buffer impedance can 
change during signal switching, a conventional parallel termination does not 
help. Serial matching resistors can be added at each end of all communication 
port lines (see Figure 8-1). Serial resistors help match the output buffer im- 
pedance to the line impedance and protect against signal contention caused 
by any potential fault condition. The resistor value, plus buffer output imped- 
ance, should match the line impedance. Results have shown that a lower than 
optimal serial resistor value provides better performance. A resistor value of 
22-33 Q is usually a reasonable start. Some experimentation may be needed 
to reduce ringing effects. A good received signal should have an undershoot 
of 0.5 to 1.0 V or less. A resistor value that is too high results in an under- 
damped falling edge that does not cross the zero logic level and should be 
avoided. 


Figure 8—1. Impedance Matching for ’C4x Communication-Port Design 


Pin as an Output V Pin as an Input 
Voc CC 
10 kQ 10 kQ 
wb $4 > 
Rb Rs Rs 

22-33 Q Zp=50-100 2 

(Lower than 

optimum) 


Even though pullup resistors do not help for impedance matching, they are 
recommended at each end to avoid unintended triggering after reset, when 
RESET going low is not received on all ’C4x devices at the same time. 


A pulldown resistor is not desirable, because it increases power consumption, 
does not protect the device from a fault condition, and can cause token loss 
and byte slippage on reset. 
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For jumps to other boards or for long distances, a unidirectional data flow with 
buffering is the preferred method. In this case, use buffers with hysteresis for 
CSTRB and CRDY at each end with delays greater than those in the data bus. 
This has two advantages: it cleans up the signals and helps eliminate glitches 
that can be erroneously perceived as valid control; it also allows the data bits 
to settle before the receiver sees CSTRB going low. 


Interfacing With a Non-’C4x Device 


8.3 Interfacing With a Non-’C4x Device 


To guarantee a correct word transfer operation between a ’C4x communica- 
tion port and a non-’C4x device, the non-’C4x device should mimic the hand- 
shaking operation between CSTRB and CRDY (word transfer), CREQ and 
CACK (token transfer). The token transfer operation is more complex than the 
word transfer operation. It requires tri-stating of pins after different events. 
Sections 8.6 and 8.7 offer examples on how to handle token transfers with 
non-’C4x devices. The word transfer operation is much simpler. The following 
sequence describes the word transfer operation: 


Word transfer operation 
CASE I: The non-’C4x has the token and transmits data. The ’C4x receives data. 


1) Thenon-’C4x device drives the first byte (byte 0) into the CD data lines and 
then drops CSTRB low, indicating new data. There is no need to meet the 
maximum timing requirements, but the data should be valid before 
CSTRB goes low. 


2) Thenon-’C4x device waits for the ’C4x to respond with CRDY low and then 
can immediately drive the next data byte and bring CSTRB high. 


3) The non-’C4x device waits for CRDY to be high; then, steps 1, 2, and 3 
repeat for bytes 1-3. 


4) After byte 3 is transmitted, the non-’C4x device can leave the byte 3 value 
in the CD lines until a new word is sent. 


5) In’C4x device revisions lower than 3.0, CSTRB should go high after re- 
ceiving CRDY low no later than one ’C4x H1/H3 cycle between word 
boundaries. See Section 8.9, Implementing a CSTRB Shortener Circuiton 
page 8-17, foranimplementation of a CSTRB shortener circuit. In’C4x de- 
vice revisions 3.0 or higher, no CSTRB width restriction exists. 


6) The non-’C4x device can drive CSTRB low for the next word at any time 
after receiving CRDY high from the last byte. There is no reason to wait 
for the internal ’C4x synchronizer between CRDY low and CSTRB low for 
the next word to finish. 


CASE Il: The ’C4x has the token and transmits data. The non-’C4x device re- 
ceives data. 


1) After receiving CRSTB low from the ’C4x, indicating new data valid, the 
non-’C4x device can immediately read the data byte and then drive CRDY 
low, indicating that the byte has been read. There is no maximum time limit 
between these two events. 


2) The non-’C4x device then waits to receive CSTRB high and can immedi- 
ately drive CRDY high, ending the byte transfer operation. 
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8.4 Terminating Unused Communication Ports 


To avoid unintended communication port triggering, you can terminate unused 
communication-port control lines in one of the following ways: 


Lj) Use pullup resistors in all the communication-port control lines. Pullups in 
data lines of input communication ports are optional, but they lower power 
consumption. Pullups in data lines of output communication ports are not 
required; if used, they increase power consumption. 


(1 Tie the control lines together on the same communication port, that is, 
CSTRB to CRDY and CREQ to CACK. This holds the control inputs high 
without using external pullup resistors. 


8.5 Design Tips 


Design Tips 


Be careful with different voltage levels when running multiple ’'C4x devices 
(or any other CMOS device) from different power supplies. This can create 
a CMOS latch-up that can permanently damage your device. Adding serial 
resistors to ’C4x communication ports connecting devices in different 
boards marginally helps to protect communication-port drivers. It is rec- 
ommended that all ’C4x devices in the system remain in reset until power 
supplies are stable. 


Sometimes, itis beneficial to keep the line impedance as high as possible. 
This helps when interfacing to external cables. Typical ribbon cable im- 
pedance is about 100 Q. 


Because it is sometimes difficult to route high-impedance lines (especially 
long ones) in a circuit board, use an external ribbon cable to jump over the 
length of a board. In this case, only two headers should be installed in the 
circuit board. 


Use an alternating signal and ground scheme. This helps control differen- 
tial signal coupling and impedance variation. For quality signals, use a 
26-wire ribbon ((4 control + 8 data + 1 shield) * 2 = 26). The shield is need- 
ed for the signal that is otherwise on the edge. 


Do not route signals on top of each other. When it is necessary to cross 
traces on adjacent layers, cross them at right angles to reduce coupling. 


Note: 


Because the ’C4x communication ports are very high-speed data transmis- 
sion circuits, signal quality is very important. A poor quality signal can cause 
the missing or slipping of a byte. If this happens, the only solution is a ’C4x 
reset. Because at reset communication ports 0,1, and 2 are transmitters and 
3, 4, and 5 are receivers, a safe reset requires resetting of every ’C4x con- 
nected to the 'C4x with the faulty condition. Global reset becomes a neces- 


sity. 
|) 
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8.6 Commport to Host Interface 


8.6.1 


A host interface between a ’C4x comport and a PC’s bidirectional printer port 
has many advantages including freeing up the DSP bus and treating the host 
PC as a virtual 'C4x node within a system of ’C4x devices. 


This interface uses a bidirectional PC printer port interface. Logic circuits, buff- 
ers and resistors convert logic control levels driven from the printer port into 
’C4x commport control signals. Signals driven from the 'C4x are converted into 
status signals, which can be polled in software by the PC. In addition, the PC’s 
printer port provides the byte-wide data path into and out of the PC. 


You can use this I/O interface for host-data communication, bootloading, and 
debug operations. With proper buffering and software control, it is also pos- 
sible to build long and reliable links. The speed is primarily dependent on the 
speed of the host. When using a PC as the host, the speed is limited by the 
PC’s I/O channel speed. If higher rates are needed, use a memory-mapped 
version of the printer port in the PC. 


The printer port used to test this circuit was the DSP-550 from STB Systems, 
but there are other bidirectional printer ports on the market. Using the STB card 
in the bidirectional mode requires that a jumper be set (see your manual). 
Then, if a 1 is written to bits 5 or 7 of the control register (this depends on your 
printer port), data can be read back from the data register. 


Simplified Hardware Interface for C40 PG = 3.3, or C44 devices 


Figure 8—2 shows a simplified commport signal splitter that splits each comm- 
port control signal into a simple drive and sense pair of signals. Simplified, in 
this case, means that, though the circuit is easy to follow functionally and will 
operate, itis not the preferred solution (see the improved driver in Figure 8-3). 
The signals in this circuit can be easily buffered without risk of driver conflicts. 
However, keep a few things in mind about the simplified design: 


[1 Due to commport-control signal restrictions in earlier silicon revisions this 
circuit will not work with the TMS320C40 PG 3.0 or lower. 


Lj This circuit requires a bidirectional printer port. 


Standard printer-port cables often do not provide ‘clean’ signals 


uu 


[1 Ahigh value is needed for the isolation resistor in order to keep the current 
levels during signal opposition to a minimum. But, a low value is needed 
for the isolation resistor in order to insure reasonably fast rise and fall times 
of the commport control signals when they are inputs. This conflict can be 
overcome by carefully picking the correct resistor values or by adding 
additional biasing. 


Commoport to Host Interface 


Figure 8-2. Better Commport Signal Splitter 


Vec 
ae t hee 
Comm Port 
Port R 
: LS32 
~)> RESET 
Rs CREQ_sns 
Busy 
R 
CREQ_drv x — 
SLCTIN ® > e = CREQ 
Rg 
ACK | CACK_sns 
R 
CACK_d x — 
INIT e ee e = CACK 
Rg 
BkpER nee < CSTRB_sns 
CSTRB_dv = -—~Rx — 
AUTOFD 6 CSTRB 
Rs CRDY_sns 
SLOT 
CRDY_drv Rx ss 
STROBE © CRDY 
DO DO 
D7 D7 
Ry 
Legend: Rp =470 ohms R, = 180 ohms 


Rg = 47 ohms Ry = 220 ohms 
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8.6.2 Improved Drive and Sense Amplifiers 


Two improvements are suggested for the interface described above. The 
improvements are described in Figure 8-3. 


Figure 8—3. Improved Interface Circuit 


sense WN 
Ry 
Voc 
Parallel ’C4x 
Port comm port 
Rp 
Cxxx 
Ry 
RS-232 
driver 
Re 
| 
i 
Cy 
Legend: Rp, =470 ohms Ro = 10 Kohms C; = 100 pF 
Ry =1Kohms Rg = 50 ohms 


The first improvement is that the signals going to and from the printer port are 
synchronized using a clock and a simple data latch. By taking samples in time, 
noise which may be able to corrupt the first sample of a transistion will probably 
not be enough to corrupt the next sample. By adding a hysteris loop made from 
resistors R1 and R2, the noise immunity is improved more. Capacitor C1 is an 
additional analog filter that rejects high-frequency noise. 


The next major improvement is the use of a current driver in place of the isola- 
tion resistor. In this case, an RS232 driver is used; this driver can drive beyond 
the supply rails of the DSP and has a built-in current limit of about 20mA. 
Diodes D1 and D2, along with R83, clamp the resulting signal to the supply rails 
of the DSP and latch to prevent excessive overdrive. The DSP and latch both 
have internal clamping diodes, but it is not recommended that you rely on them 
as the internal clamp diodes are not intended for this purpose. 


Commoport to Host Interface 


8.6.3 How the Circuit Works 


The PC can drive any value on the control lines, independent from the returned 
status. If a logic 1 is driven into the drive side of the isolation resistor and a logic 
0 is observed on the sense side, the *C4x commport signal under question is 
without a doubt an output. 


By then driving levels and polling the returned status, it is possible to synchro- 
nize a host processor to the state machine of the ’C4x commport. The advan- 
tage of this design is that it can be easily ported to any smart processor with 
any basic I/O capability. For example, TMS320C31/32 devices have been 
used as slave devices that are bootloaded from a commport and then used as 
serial ports with internal memory and additional processing capabilities. Com- 
plicated and risky ASIC designs are not required and the solution is fully pro- 
grammable. 


You must include current limiting circuitry when designing any 


*C4x interface. If the current is not limited, it can exceed 100 mA per 
pin, which can damage a device. 


8.6.4 The Interface Software 


The interface software for this host interface is available through the TI BBS 
(filename: M4x_2.exe). This file contains not only the low-level software driv- 
ers, but also extra code for the M4x (a multiprocessor ’C4x communication ker- 
nel) applications note. The following files are contained in this application: 


M4X Debugger (no source code) 

MEMVIEW memory and communications matrix view and edit utility 
MANDEL40 multiprocessor Mandelbrot demonstration program 
M4X.ASM multiprocessor TMS320C4x communications kernel 
DRIVER.CPP higher level system functions 

TARGET.CPP getmem, putmem, run, stop and singlestep commands 
OBJECT.CPP source code for using the printer port interface 


OOUUCUOUU 
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8.7 An I/O Coprocessor-’C4x Interface 


This section presents a software-based interface that provides a ’C4x witha 
flexible bidirectional interface to a TMS320C32. The ’C32 acts as a smart I/O 
coprocessor that can provide AIC interfacing and data preprocessing among 
others. The ’C32 is an inexpensive and flexible solution. 


Some of the advantages of using an I/O coprocessor include: 
(J An I/O coprocessor can provide with data-processing. 


[1 An I/O coprocessor allows for error correction and recovery from ’C4x 
commport interface problems. 


(1 An/l/O coprocessor can buffer data, allowing faster ’'C4x data throughput. 


Figure 8—4 shows the ’C32-to-’C4x interface. Through the interface, a ’C4x 
commport is memory-mapped to the C32 external memory bus. The interface 
uses four ’C32 I/O pins to drive the commport control signals. 


Figure 8-4. A 'C32 to ’'C4x Interface 


Vcc 


’C4x output 
comm port 
CREQ 
CACK 
CRDY 
CSTRB 


DO 


D7 


Pullup resistors in the XFO, XF1, TCLKO and TCLK1 lines are used to prevent 
undesired glitches due to temporary high-impedance conditions. Serial resis- 
tors are also used on the same pins for better impedance matching. 


The interface software drivers and a more detailed explanation of the interface 
can be obtained from our TI BBS (filename 4xaic.exe). Token transfer and 
word transfer drivers are included with the software. 


Implementing a Token Forcer 


8.8 Implementing a Token Forcer 


After system reset, half of the communication channels associated with a par- 
ticular ’C4x have token ownership (communication ports 0, 1, 2), and the other 
half (communication ports 3, 4, 5) do not. 


If, because of system configuration requirements, communication port direc- 
tion must to be changed, the circuits shown in Figure 8—5 and Figure 8-6 can 
be used. The circuits force the token to be passed and communication port 
direction to remain changed. 


Even though these circuits are intended to force a change of the original com- 
munication port direction after reset, they can be used also to maintain the orig- 
inal direction. However, this can be more conveniently achieved using pullups 
in CACK and CREQ. The pullups prevent any damage to the communication 
ports in the event of a program error that writes into a port configured as an 
input. 


Forcing a communication port to become an output port 


Figure 8—5 shows a circuit that forces a communication port to become an out- 
put port. In this circuit, driving the CACK line with the CREQ line reconfigures 
an input port as an output port. When a word is written to the FIFO, CREQ is 
driven low, indicating a token request. After a synchronizer delay of 1 to 2 
cycles (U1 and U2), CACK is driven low, indicating a token acknowledge. 
CREQ then goes active high and then is held high by Rp as the line switches 
to an input. The CLK signal can be any clock with a frequency equal to or lower 
than the H1/H3 clock. 


The synchronizer delay is important. If no delay is provided, the CREQ line will 
not be ready to change to an input high condition. As a result, the CACK line, 
which, at this point, is a delayed version of CREQ, is inverted and applied to 
the CREQ line. This results in an oscillation until the synchronizer period has 
timed out. 


Figure 8—5. A Token Forcer Circuit (Output) 


V 
ce e CLK 
U1 U2 
Rp 
10 kQ <] <] 

CREQ > 
CACK < vA 

Rs 

470Q 
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Forcing a communication port to become an input port 


Figure 8-6 shows a circuit that forces a communication port to become an in- 
put port. In this circuit, driving the CREQ line with an inverted CACK reconfi- 
gures an input port as an output. If CREQ is an input, it is held low through Rs 
whenever CACK is high or floating high because of Rp. The port then responds 
to this request by driving CACK low, which, in turn, drives CREQ high, finishing 
the token acknowledge. As in Figure 8—5, synchronizer delays mimic the re- 
sponse of another ’C4x communication port to prevent oscillation. 


Figure 8-6. Communication-Port Driver Circuit (Input) 


Voc 


° CLK 


Rp 
10 kQ < <] 


CREQ < VW 
Rg 470 Q Inverter 


Note that after the port has been reconfigured as an input port, the CREQ line 
is active high while the output of the inverter is low. This causes a constant cur- 
rent flow from CREQ to the inverter. 
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8.9 Implementing a CSTRB Shortener Circuit 


In ’C40 device revisions lower than 3.0, the width of the CSTRB low pulse be- 
tween word boundaries should not exceed 1.0 H1/H3 at the receiving end. A 
CSTRB low beyond the synchronization period on a word boundary can be 
recognized as a new valid CSTRB, resulting in an extra byte reception (byte 
slippage). For a short distance between two communicating ’C4x devices, 
byte slippage is not a problem. In ’C40 device revisions 3.0 or higher, or in any 
revision of the ‘C44, no CSTRB width restriction exists. 


The circuit shown in Figure 8—7 can reduce the width of CSTRB for very long 
distances when you are using ’C4x device revisions lower than 3.0. The circuit 
has buffers for CSTRB and CRDY on the transmitting end and two S-R flip- 
flops on the receiving end. On the receiving end, a low STRB incoming signal 
causes the Q signal of S-R flip-flop U1 to go low, forcing the CSTRB pin to go 
low. When CRDY responds with a low signal, S-R flip-flop U2 drives the RDY 
signal low. Because RDY is also tied to the S input of U1, and S has prece- 
dence over Rin an S-R flip-flop, Qin U1 goes high. Also, STRB is inverted and 
drives the S input of U2. In this way, the width of the local CSTRB is shortened, 
regardless of the channel length. When the STRB signal goes back high, the 
S-R flip-flop pair is ready to receive another CSTRB. 


Figure 8—7. CSTRB Shortener Circuit 
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8.10 Parallel Processing Through Communication Ports 


The ’C4x communication ports are key to parallel processing design flexibility. 
Many processors can be linked together in a wide variety of network configura- 
tions. In this section, Figure 8-8 illustrates 'C4x parallel processing connectiv- 
ity networks that are used to fulfill many signal processing system needs. 


Figure 8-8. 'C4x Parallel Connectivity Networks 


Pipelined Linear Array 2D Array 


For convolution and correlation and other pipelined Excellent for image processing. 
operations in graphics and modem applications. 


Communication 
Port Connection 


Parent 
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Tree Structures Bidirectional Ring 
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speech and image recognition applications. port for more I/O. Very effective for neural networks. 
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Figure 8-8. 'C4x Parallel Connectivity Networks (Continued) 


He, 
Go 


*C4x . 
Communication 
port connection 


CHE 
iH 


3-D Grid 
For hierarchical processing such as 
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Hexagonal Grid 
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4-D Hypercube 
Fully-Connected Network A more general-purpose structure. 


According to memory interface, ’'C4x parallel system architecture can be clas- 
sified in three basic groups: 
.) Shared-Memory Architecture: shares global memory among processors. 


.) Distributed-Memory Architecture: each processor has its own private local 
memory. Interprocessor communication is via ’C4x communication ports. 


[J Shared- and Distributed-Memory Architecture: each processor has its 
own local memory but also shares a global memory with other processors. 


Figure 8-8 shows examples of these basic groups. 
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8.11 Broadcasting Messages From One ’C4x to Many ’C4x Devices 
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Message broadcasting from one ’C4x to many ’C4x devices requires a simple 
interface. However, try to avoid signal analog delays caused by distance differ- 
ences between the 'C4x master and the ’C4x slave processor. These delays 
could create bus contention in the CSTRB and CRDY lines. Figure 8—9 shows 
the block diagram of a multiple processor system. In this design, one ’C4x is 
the dedicated transmitter, and three 'C4x devices are dedicated receivers. No 
reset circuitry is needed, because the transmitter is communication port 0, and 
the receivers are communication ports 3, 4, and 5. At reset, ’C4x communica- 
tion ports 0, 1, and 2 are output ports, and communication ports 3, 4, and 5, 
are input ports. 


Because the communications configuration is fixed, no token transfer is need- 
ed; this allows the CREQ and CACK pins of all processors to be individually 
pulled up to 5 volts through 22-kQ resistors. 


In all cases, each CSTRB should be individually buffered to ensure that line 
reflections do not corrupt each received CSTRB signal. The data pins CD7—0 
of intercommunicating ’C4x devices can be tied together. In general, for fewer 
than three receivers and distances shorter than six inches, data skew relative 
to CSTRB is nota problem, and data buffering is not needed. However, if more 
than three receivers must be driven by a single transmitter or the distance is 
more than six inches, both the CSTRB and CD7-0 lines must be buffered. 


The CRDY signal input is generated by ORing the RDY outputs of all of the 
receiver communication ports. The transmitter should not receive a RDY sig- 
nal until the receiver has received all data. 


In addition, to ensure that the dedicated receiver ’C4x devices do nottry to arbi- 
trate for the communication-port bus, you should halt the output ports of the 
receiver 'C4x devices by setting bit four of their communication-port control 
registers to one. 


Broadcasting Messages From One ’C4x to Many ’'C4x Devices 


Figure 8—9. Message Broadcasting by One 'C4x to Many ’C4x Devices 
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‘C4x Power Dissipation 


The power-supply current requirement (Ipp) of the ’C4x vary with the specific 
application and the device program activity. The maximum power dissipation 
of a device can be calculated by multiplying Ipp with Vpp (power supply volt- 
age requirement). Both parameters are provided in the ’C4x data sheet. Addi- 
tionally, due to the inherent characteristics of CMOS technology, the current 
requirements depend on clock rates, output loadings, and data patterns. 


This chapter presents the information you need to determine power-supply 
current requirements for the ’C4x under various operating conditions. After 
you make this determination, you can then calculate the device power dissipa- 
tion, and, in turn, thermal management requirements. 
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Capacitive and Resistive Loading 


9.1 Capacitive and Resistive Loading 


9-2 


In CMOS devices, the internal gates swing completely from one supply rail to 
the other. The voltage change on the gate capacitance requires a charge 
transfer, and therefore causes power consumption. 


The required charge for a gate’s capacitance is calculated by the following 
equation: 


Qate = VoD X Cgate (coulombs) 

where: 

Qgate is the gate’s charge, 

Vpp is the supply voltage, and 

Cgate is the gate’s capacitance. 

Since current is coulombs per second, the current can then be obtained from: 
| = coul/s = Vpp X Cgate xX Frequency 

where: 

lis the current. 


For example, the current consumed by an 80-pF capacitor being driven by a 
10-MHz CMOS level square wave is calculated as follows: 


/ 


5 (volts) x 80 x 10-12(farads) x 10 x 106(charges/s) 
= 4mA @ 10MHz 


Furthermore, if the total number of gates in a device is known, the effective 
total capacitance can be used to calculate the current for any voltage and fre- 
quency. For a given CMOS device, the total number of gates is probably not 
known, but youcan solve for acurrent at a particular frequency and supply volt- 
age and later use this current to calculate for any supply voltage and operating 
frequency. 


ldevice = VDD X Gtotal X ‘CLK 

where: 

device is the current consumed by the device, 
Crota/ is the total capacitance, and 


IoLk is the clock cycle. 


Capacitive and Resistive Loading 


Solving for power (P = V x /), the equation becomes: 


Pyevice = Vop* x Cota) X feLk 
where: 
Pgevice is the power consumed by the device. 


In this case, Cio¢q/ includes both internal and external capacitances. Ciozq; can 
be effectively reduced by minimizing power-consuming internal operation and 
external bus cycles. Bipolar devices, pullup resistors and other devices con- 
sume DC power that adds a constant offset unaffected by fo, x. The effect of 
these DC losses depends on data, not frequency. This document assumes an 
all-CMOS approach in which these effects are minimal. 


Another source of power consumption is the current consumed by a CMOS 
gate when it is biased in the linear region. Typically, if a gate is allowed to float, 
it can consume current. Pullups and pulldowns of unused pins are therefore 
recommended. 
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9.2 Basic Current Consumption 


9.2.1 


Generally, power supply current requirements are related to the system—for 
example, operating frequency, supply voltage, temperature, and output load. 
In addition, because the current requirement for a CMOS device depends on 
the charging and discharging of node capacitance, factors such as clocking 
rate, output load capacitance, and data values can be important. 


Current Components 


The power supply current has four basic components: 
[J Quiescent 

[j Internal operations 

[1 Internal bus operations 

[j External bus operations 


9.2.2 Current Dependency 


9-4 


The power supply current consumption depends on many factors. Four are 
system related: 

J] Operation frequency 

Lj) Supply voltage 

] Operating temperature 

[j Output load 


Several others are related to TMS320C4x operation: 
Duty cycle of operations 

Number of buses used 

Wait states 

Cache usage 

Data value 


OOOO 


You can calculate the total power supply current requirement for a ’C4x device 
by using the equation below, which comprises the four basic power supply cur- 
rent components and three system-related dependencies described above. 


hotal = (Ig + lops + Iibus + kbus) X FX Vx T 
where: 

kota is the total supply current, 

ig is the quiescent current component, 

liops is the current component due to internal operations, 


bug is the current component due to internal bus usage, including data value 
and cycle time dependency, 


Basic Current Consumption 


bug is the current component due to external bus usage, including data value 
wait state, cycle time, and capacitive load dependency, 


Fis a scale factor for frequency, 
Vis a scale factor for supply voltage, and 
Tis a scale factor for operating temperature. 


This report describes in detail the application of this equation and determina- 
tion of all the dependencies. The power dissipation measurements in this re- 
port were taken using a ’C40 PG 3.X running at speeds up to 50 MHz and at 
a voltage level of 5 V. 


The minimum power supply current requirementis 130 mA. The typical current 
consumption for most algorithms is 350 mA, as described in the TMS320C4x 
data sheet, unless excessive data output is being performed. 


The maximum current requirement for a ’C4x running at 50 MHz is 


850 mA and occurs only under worst case conditions: writing 
alternating data (AAAA AAAA to 5555 5555) out of both external 
buses simultaneously, every cycle, with 80 pF loads. 


9.2.3 Algorithm Partitioning 


Each part of an algorithm has its own pattern with respect to internal and exter- 
nal bus usage. To analyze the power supply current requirement, you must 
partition an algorithm into segments with distinct concentrations of internal or 
external bus usage. Analyze each program segment to determine its power- 
supply current requirement. You can then calculate the average power supply 
current requirement from the requirements of each segment of the algorithm. 


9.2.4 Test Setup Description 


All TMS320C4x supply current measurements were performed on the test 
setup shown in Figure 9-1. The test setup consists of a TMS320C40, capaci- 
tive loads on all data and address lines, but no resistive loads. A Tektronix digi- 
tal multimeter measures the power supply current. Unless otherwise specified, 
all measurements are made at a supply voltage of 5 V, an input clock frequency 
of 50 MHz, a capacitive load of 80 pF, and an operating temperature of 25°C. 
Note that the current consumed by the oscillator and pullup resistors does not 
flow through the current meter. This current is considered part of the system’s 
resistive loss (see section 9.1, Capacitive and Resistive Loading). 
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Basic Current Consumption 


Figure 9-1. Test Setup 
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Current Requirement of Internal Components 


9.3 Current Requirement of Internal Components 


9.3.1 


9.3.2 


Quiescent 


The power-supply current requirement for internal circuitry consists of three 
components: quiescent, internal operations, and internal bus operations. 
Quiescent and internal operations are constants, whereas the internal bus 
operations component varies with the rate of internal bus usage and the data 
values being transferred. 


The quiescent requirement for the TMS320C4x is 130 mA while in IDLE. 
Quiescent refers to the baseline supply current drawn by the TMS320C4x dur- 
ing minimal internal activity. Examples of quiescent current include: 


.j) Maintaining timer and oscillator 


(J Executing the IDLE instruction 


.) Holding the TMS320C4x in reset 


Internal Operations 


Internal operations include register-to-register multiplication, ALU operations, 
and branches, but not external bus usage or significant internal bus usage. In- 
ternal operations add a constant 60 mA above the quiescent requirement, so 
that the total contribution of quiescent and internal operation is 190 mA. Note, 
however, that internal and/or external program operations executed via an 
RPTS instruction do not contribute an internal operations power supply current 
component. During an RPTS instruction, program fetch activity other than the 
instruction being repeated is suspended; therefore, power-supply current is 
related only to the data operations performed by the instruction being 
executed. 
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Current Requirement of Internal Components 


Figure 9-2. Internal and Quiescent Current Components 
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Internal Bus Operations 


The internal bus operations include all operations that utilize the internal buses 
extensively, such as internal RAM accesses every cycle. No distinction is 
made between internal reads or writes, such as instruction or operand fetches 
from internal memory, because internally they are equal. Significant use of 
internal buses adds a data-dependent term to the equation for the power sup- 
ply current requirement. Recall that switching requires more current. Hence, 
changing data at high rates requires higher power-supply current. 


Pipeline conflicts, use of cache, fetches from external wait-state memory, and 
writes to external wait-state memory all affect the internal and external bus 
cycles of an algorithm executing on the TMS320C4x. Therefore, you must 
determine the algorithm’s internal bus usage in order to accurately calculate 
power supply current requirements. The TMS320C4x software simulator and 
XDS emulator both provide benchmarking and timing capabilities that help you 
determine bus usage. 


Current Requirement of Internal Components 


Figure 9-3. Internal Bus Current Versus Transfer Rate 
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The current resulting from internal bus usage varies linearly with transfer rates. 
Figure 9-3 shows internal bus-current requirements for transferring alternat- 
ing data (AAAA AAAAh to 5555 5555h) at several frequencies. Note that trans- 
fer rates greater than the TMS320C4x’s MIPS rating are possible because of 
internal parallelism. 


The data set AAAA AAAAh to 5555 5555h exhibits the maximum internal bus 
current for data transfer operations. The current required for transferring other 
data patterns may be derated accordingly, as described later in this subsec- 
tion. 


As the transfer rate decreases (that is, transfer-cycle time increases) the incre- 
mental Ipp approaches 0 mA. This figure represents the incremental Ipp due 
to internal bus operations and is added to quiescent and internal operations 
current values. 


For example, the maximum transfer rate corresponds to three accesses every 
cycle (one program fetch and two data transfers) or an effective one-third H1 
transfer cycle time. At this rate, 178 mA is added to the quiescent (130 mA) 
and internal operation (60 mA) current values for a total of 368 mA. 
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Figure 9-3 shows the internal bus current requirement when transferring As 
followed by 5s for various transfer rates. Figure 9-4 shows the data depen- 
dence of the internal bus-current requirement when the data is other than As 
followed by 5s. The trapezoidal region bounds all possible data values trans- 
ferred. The lower line represents the scale factor for transferring the same 
data. The upper line represents the scale factor for transferring alternating 
data (all Os to all Fs or all As to all 5s, etc.). 


The possible permutation of data values is quite large. The term relative data 
complexity refers to a relative measure of the extent to which data values are 
changing and the extent to which the number of bits are changing state. There- 
fore, relative data complexity ranges from 0, signifying minimal variation of 
data, to a normalized value of 1, signifying greatest data variation. 


Figure 9-4. Internal Bus Current Versus Data Complexity Derating Curve 


Normalized Ipp 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 


Operation Complexity 


If a statistical knowledge of the data exists, Figure 9-4 can be used to deter- 
mine the exact power supply requirement on the basis of internal bus usage. 
For example, Figure 9-4 indicates a 89.5% scale factor when all Fs 
(FFFF FFFFh) are moved internally every cycle with two accesses per cycle 
(80 Mbytes per second). Multiplying this scale factor by 178 mA (from 


Current Requirement of Internal Components 


Figure 9-3) yields 159 mA due to internal bus usage. Therefore, an algorithm 
running under these conditions requires about 349 mA of power supply current 
(130 + 60 + 159). 


Since a statistical knowledge of the data may not be readily available, a nomi- 
nal scale factor may be used. The median between the minimum and maxi- 
mum values at 50% relative data complexity yields a value of 0.93 and can be 
used as an estimate of a nominal scale factor. Therefore, this nominal data 
scale factor of 93% can be used for internal bus data dependency, adding 
165.5 mA to 130 mA (quiescent) and 60 mA (internal operations) to yield 355.5 
mA. As an upper bound, assume worst case conditions of three accesses of 
alternating data every cycle, adding 178 mA to 130 mA (quiescent) and 60 mA 
(internal operations) to yield 368 mA. 
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9.4 Current Requirement of Output Driver Components 


The output driver circuits on the TMS320C4x are required to drive significantly 
higher DC and capacitive loads than internal device logic drivers. Because of 
this, output drivers impose higher supply current requirements than other sec- 
tions of circuitry in the device. 


Accordingly, the highest values of supply current are exhibited when external 
writes are being performed at high speed. During read cycles, or when the 
external buses are not being used, the TMS320C4x is not driving the data bus; 
this eliminates a significant component of the output buffer current. Further- 
more, in many typical cases, only a few address lines are changing, or the 
whole address bus is static. Under these conditions, an insignificant amount 
of supply current is consumed. Therefore, when no external writes are being 
performed or when writes are performed infrequently, current due to output 
buffer circuitry can be ignored. 


When external writes are being performed, the current required to supply the 
output buffers depends on several considerations: 


.j Data pattern being transferred 
.j Rate at which transfers are being made 


(1 Number of wait states implemented (because wait states affect rates at 
which bus signals switch) 


[j) External bus DC and capacitive loading 


External bus operations involve external writes to the device and constitute a 
major power-supply current component. The power supply current for the 
external buses, made up of four components, is summarized in the following 
equation: 


kkbus = ( base local + local ) + ( base global + Iglobal ) 
where: 


base local/global 's the current consumed by the internal driver and pin capaci- 
tance, 


Nocal is the local bus current component, and 
[global is the global bus current component. 


The remainder of this section describes in detail the calculation of external bus 
current requirements. 


Current Requirement of Output Driver Components 


OO OEO——__—_—_—_—_—————— 


Note: 


The DMA current component (/pjya) and communication port current compo- 
nent (icp) should be included in the calculation of kpyg if they are used in the 
operations. 


a) 


9.4.1. Local or Global Bus 


The current due to bus writes varies with write cycle time. As discussed in the 
previous section, to obtain accurate current values, you must first determine 
the rate and timing for write cycles to external buses by analyzing program 
activity, including any pipeline conflicts that may exist. To do this, you can use 
information from the TMS320C4x emulator or simulator as well as the 
TMS320C4x User’s Guide. In your analysis, you must account for effects from 
the use of cache, because use of cache can affect whether or not instructions 
are fetched from external memory. 


When evaluating external write activity in a given program segment, you must 
consider whether or not a particular level of external write activity constitutes 
significant activity. If writes are being performed at a slow enough rate, they 
do not impact supply current requirements significantly and can be ignored. 
This is the case, however, only if writes are being performed at very slow rates 
on either the local or global bus. 


When bus-write cycle timing has been established, Figure 9-5 can be used 
to determine the contribution to supply current due to bus activity. Figure 9-5 
shows values of current contribution from the local or global bus for various 
transfer rates. This data was gathered when alternating values of 555555555h 
and AAAAAAAAD were written at a capacitive load of 80 pF per output signal 
line. This condition exhibits the highest current values on the device. The val- 
ues presented in the figure represent the incremental current contributed by 
the local or global bus output driver circuitry under the given conditions. Cur- 
rent values obtained from this graph are later scaled and added to several 
other current terms to calculate the total current for the device. As indicated 
in the figure, the lower limit hase = |g + hops + ibus is essentially fo; for transfer 
rates less than 1 Mword/second. 
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Figure 9-5. Local/Global Bus Current Versus Transfer Rate and Wait States 
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Figure 9-5 demonstrates a feature of the ‘C4x’s external bus architecture 
known as a posted write. In general, data is written to a latch (or a one deep 
FIFO) and held by the bus until the bus cycle is complete. Since the CPU may 
not require that bus again for some time, the CPU is free to perform operations 
on other buses until a conflict occurs. Conflicts include DMA, a second write, 
or a read to the bus. 


In Figure 9—5, the upper line is applicable when STI || STI is not dominated by 
execution of internal NOPs and the external wait state is equal to zero. The 
lower line shows when STI || STlis internally stalled while waiting for the exter- 
nal bus to go ready because of wait states. The addition of NOPs between 
successive STI || STI operations contributes to internal bus current and there- 
fore does not result in the lowest possible current. 
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Figure 9-6. Local/Global Bus Current Versus Transfer Rate at Zero Wait States 
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Table 9-1. Wait State 


Transfer Rate (Mword/second) 


To further illustrate the relationship of current and write cycle time, Figure 9-6 
shows the characteristics of current for various numbers of cycles between 
writes for zero wait states. The information on this graph can be used to obtain 
more precise values of current whenever zero wait states are used. Table 9—1 
lists the number of cycles used for software generated wait states. 


Timing Table 


Wait State Read Cycles Write Cycles 
0 1 2 
1 2 3 
2 3 4 
3 4 5 


Once a current value has been obtained from Figure 9—5 or Figure 9-6, this 
value can be scaled by a data dependency factor if necessary, as described 
on page 9-16. This scaled value is then summed along with several other cur- 
rent terms to determine the total supply current. 
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9.4.2 DMA 


Using DMA to transfer data consumes power that is data dependent. The cur- 
rent resulting from DMA bus usage (/pjya) varies linearly with the transfer rate. 
Figure 9—7 shows DMA bus current requirements for transferring alternating 
data (AAAA AAAAh to 5555 5555h) at several transfer rates; it also shows that 
current consumption increases when more DMA channels are used. However, 
as more DMA channels are used, the incremental change in current dimi- 
nishes as the internal DMA bus becomes saturated. Note that DMA current is 
superimposed over fipyg (internal bus) value. 


Figure 9—7. DMA Bus Current Versus Clock Rate 
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9.4.3 Communication Port 


Communication port operations add a data-dependent term to the equation for 
the current requirement. The current resulting from communication port opera- 
tion (/¢p) varies linearly with the transfer rate. Figure 9-8 shows communica- 
tion port operation current requirements for transferring alternating data 
(AAAA AAAA to 5555 5555h) at several transfer rates; it also shows that cur- 
rent consumption increases when more communication port channels are 
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used. Similar to the DMA bus current consumption, adding communication 
ports eventually saturates the peripheral bus as more channels are added. 


Figure 9-8. Communication Port Current Versus Clock Rate 
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Note that since the communication ports are intended to communicate with 
other TMS320C4x communication ports over short distances, no additional 
capacitive loading was added. In this case, the transmission distance is about 
6 inches without additional 80- pF loads. Note that communication port current 
is superimposed over Iipys value. 


9.4.4 Data Dependency 


Data dependency of the current for the local and global buses is expressed as 
a scale factor that is a percentage of the maximum current exhibited by either 
of the two buses. 
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Figure 9-9. Local/Global Bus Current Versus Data Complexity 
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Figure 9-9 shows normalized weighting factors that can be used to scale cur- 
rent requirements on the basis of patterns in data being written on the external 
buses. The range of possible weighting factors forms a trapezoidal pattern 
bounded by extremes of data values. As the figure shows, the minimum cur- 
rent occurs when all zeros are written, while the maximum current occurs when 
alternating 5555 5555h and AAAA AAAAh are written. This condition results 
in a weighting factor of 1, which corresponds to using the values from 
Figure 9-5 and/or Figure 9-6 directly. 


As with internal bus operations, data dependencies for the external buses are 
well defined, but accurate prediction of data patterns is often either impossible 
orimpractical. Therefore, unless you have precise knowledge of data patterns, 
you should use an estimate of a median or average value for the scale factor. 
Assuming that data will be neither 5s and As nor all Os and will be varying ran- 
domly, then a value of 0.80 is appropriate. Otherwise, if you prefer a conserva- 
tive approach, you can use a value of 1.0 as an upper bound. 


Regardless of the approach taken for scaling, once you determine the scale 
factor for the buses, apply this factor to the current values you determined with 
the graphs in section 9.4.1, Local or Global Bus. 
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For example, if a nominal scale factor of 0.80 for the buses is assumed, the 
current contribution from the two buses is as follows: 


Local or Global : 0.80 x 183 mA = 106.4 mA 


9.4.5 Capacitive Loading Dependence 


Once cycle timing and data dependencies have been accounted for, capaci- 
tive loading effects should be calculated and applied. Figure 9-10 shows the 
current values obtained above as a function of actual load capacitance if the 
load capacitance presented to the buses is less than 80 pF. 


In the previous example, if the load capacitance is 20 pF instead of 80 pF, the 
actual pin current would be 1.66 mA. 


While the slope of the line in Figure 9-10 can be used to interpolate scale fac- 
tors for loads greater than 80 pF, the TMS320C4x is specified to drive output 
loads less than 80 pF; interface timings cannot be guaranteed at higher loads. 
With data dependency and capacitive load scale factors applied to the current 
values for local and global buses, the total supply current required for the 
device for a particular application can be calculated, as described in the next 
section. 


Figure 9-10. Pin Current Versus Output Load Capacitance (10 MHz) 
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9.5 Calculation of Total Supply Current 


9.5.1 


9-20 


The previous sections have discussed currents contributed by different 
sources on the TMS320C4x. Because determinations of actual current values 
are unique and independent for each source, each current source was dis- 
cussed separately. In an actual application, however, the sum of the indepen- 
dent contributions determines the total current requirement for the device. This 
total current value is exhibited as the total current supplied to the device 
through all of the Vpp inputs and returned through the Vss connections. 


Note that numerous Vpp and Vgs pins on the device are routed to a variety of 
internal connections, not all of which are common. Externally, however, all of 
these pins should be connected in parallel to 5 V and ground planes, providing 
very low impedance. 


As mentioned previously, because of the inherent differences in operations 
between program segments, it is usually appropriate to consider current for 
each of the segments independently. In this way, peak current requirements 
are readily obtained. Further, you can make average current calculations to 
use in determining heating effects of power dissipation. These effects, in turn, 
can be used to determine thermal management considerations. 


Combining Supply Current Due to All Components 


To determine the total supply current requirements for any given program 
activity, calculate each of the appropriate components and combine them in 
the following sequence: 


1) Start with 130 mA quiescent current requirement. 


2) Add 60 mA for internal operations unless the device is dormant, such as 
when executing IDLE or using an RPTS instruction to perform internal 
and/or external bus operations (see /nternal Operations section on page 
9-7). Internal or external bus operations executed via RPTS do not con- 
tribute an internal operations power supply current component. Therefore, 
current components in the next two steps may still be required, even 
though the 60 mA is omitted. 


3) If significant internal bus operations are being performed (see subsection 
9.3.2, Internal Bus Operations on page 9-8), add the calculated current 
value. 


4) If external writes are being performed at high speed (see Section 9.4, 
Current Requirements of Output Driver Components on page 9-12), then 
add the values calculated for local and global bus current components. 


5) Add DMA and communication port current requirements if they are used. 


Calculation of Total Supply Current 


The current value resulting from summing these components is the total 
device current requirement for a given program activity. 


9.5.2 Supply Voltage, Operating Frequency, and Temperature Dependencies 


Three additional factors that affect current requirements are supply voltage 
level, operating temperature, and operating frequency. However, these con- 
siderations affect total supply current, not specific components (thatis, internal 
or external bus operations). Note that supply voltages, operating temperature, 
and operating frequency must be maintained within required device specifica- 
tions. 


The scale factor for these dependencies is applied in the same manner as dis- 
cussed in previous sections, once the total current for a particular program 
segment has been determined. Figure 9—11 shows the relative scale factors 
to be applied to the supply current values as a function of both Vpp and operat- 
ing frequency. 


Figure 9-11. Current Versus Frequency and Supply Voltage 
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Power-supply current consumption does not vary significantly with operating 
temperature. However, you can use a scale factor of 2% normalized Ipp per 
50°C change in operating temperature to derate current within the specified 
range noted in the TMS320C4x data sheet. 
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Figure 9-12. Change in Operating Temperature (°C) 
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This temperature dependence is shown graphically in Figure 9-12. Note that 
a temperature scale factor of 1.0 corresponds to current values at 25°C, which 
is the temperature at which all other references in the document are made. 


9.5.3 Design Equation 
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The procedure for determining the power-supply current requirement can be 
summarized in the following equation: 


ltotal = ( Igidle + liops + lipus + xbusglobal + !xbuslocal + DMA + lop) x F x V x T 
where: 

Fis a scale factor for frequency 

Vis a scale factor for supply voltage 

Tis a scale factor for operating temperature 


Table 9-2 describes the symbols used in the power-supply current equation 
and gives the value and the number from which the value is obtained. 
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Table 9-2. Current Equation Typical Values (Fo, K = 40 MHz) 


Value 
Symbol Min Typical Max Note Reference 
Igidle2 = 20 pA 50 pA Idle2 shutdown Figure 9-2 
Igidle 130mA 130mA = 130mA Internal idle Figure 9-2 
liops 60 mA 60 mA 60 mA Branch to self internal Figure 9-2 
linus OmA 50mA 190mA Data dependent Figure 9-3, Figure 9-4 
Ixbusglobal (Max) OmA 50mA 280mA Data and Cigag Figure 9-5, Figure 9-6, 
dependent Figure 9-9 
Ixbuslocal (Max) OmA 50mA 280mA Data and Cigagq Figure 9-5, Figure 9-6, 
dependent Figure 9-9 
IDMA OmA 50mA 300mA Data and source/ Figure 9—7 
destination dependent 
Icp OmA 50 mA 250 mA Data dependent Figure 9-8 
Notes: 1) All values are scaled by frequency and supply voltage. The nominal tested frequency is 40 MHz. 


2) Externally-driven signals are capacitive-load dependent. 
3) Itis unrealistic to add all of the maximum values, since it is impossible to run at those levels. 


9.5.4 Average Current 


Over the course of an entire program, some segments typically exhibit signifi- 
cantly different levels of current for different durations. For example, aprogram 
may spend 80% of its time performing internal operations and draw a current 
of 250 mA; it may spend the remaining 20% of its time performing writes at full 
speed to both buses and drawing 790 mA. 


While knowledge of peak current levels is important in order to establish power 
supply requirements, some applications require information about average 
current. This is particularly significant if periods o 

f high peak current are short in duration. You can obtain average current by 
performing a weighted sum of the current due to the various independent pro- 
gram segments over time. You can calculate the average current for the exam- 
ple in the previous paragraph as follows: 


1=0.8 x 250 mA +0.2 x 790 mA = 358 mA 
Using this approach, you can calculate average current for any number of pro- 
gram segments. 

9.5.5 Thermal Management Considerations 


Heating characteristics of the TMS320C4x are dependent upon power dis- 
sipation, which, in turn, is dependent upon power supply current. When mak- 
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ing thermal management calculations, you must consider the manner in which 
power supply current contributes to power dissipation and to the TMS320C4x 
package thermal characteristics’ time constant. 


Depending on the sources and destinations of current on the device, some 
current contributions to /pp do not constitute a component of power dissipation 
at 5 volts. That is to say, the TMS320C4x may be acting only as a switch, in 
which case, the voltage drop is across a load and not across the ’C4x. If the 
total current flowing into Vpp is used to calculate power dissipation at 5 volts, 
erroneously large values for package power dissipation will be obtained. The 
error occurs because the current resulting from driving a logic high level into 
a DC load appears only as a portion of the current used to calculate system 
power dissipation due to Vpp at 5 volts. Power dissipation is defined as: 


P=V~x | 


where P is power, Vis voltage, and /is current. If device outputs are driving 
any DC load to a logic high level, only a minor contribution is made to power 
dissipation because CMOS outputs typically drive to a level within a few tenths 
of a volt of the power supply rails. If this is the case, subtract these current com- 
ponents out of the TMS320C4x supply current value and calculate their con- 
tribution to system power dissipation separately (see Figure 9-13). 
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Figure 9-13. Load Currents 


IDD = !ouT 


lOUT 
TMS320C4x 


TMS320C4x 


Device Output Driven Low 


Iss = OUT 


Furthermore, external loads draw supply current (/pp) only when outputs are 
driven high, because when outputs are in the logic zero state, the device is 
sinking current through Vsg, which is supplied from an external source. There- 
fore, the power dissipation due to this component will not contribute through 
Ipp but will contribute to power dissipation with a magnitude of: 


P=Vo_ X lot 


where Vo, is the low-level output voltage and Io, is the current being sunk by 
the output, as shown in Figure 9-13. The power dissipation component due 
to outputs being driven low should be calculated and added to the total power 
dissipation. 


When outputs with DC loads are being switched, the power dissipation compo- 
nents from outputs being driven high and outputs being driven low should be 
averaged and added to the total device power dissipation. Power components 
due to DC loading of the outputs should be calculated separately for each pro- 
gram segment before average power is calculated. 


Note that unused inputs that are left unconnected may float to a voltage level 
that will cause the input buffer circuits to remain in the linear region, and there- 
fore contribute a significant component to power supply current. Accordingly, 
if you want absolute minimum power dissipation, you should make any unused 
inputs inactive by either grounding or pulling them high. If several unused 
inputs must be pulled high, they can be pulled high together through one resis- 
tor to minimize component count and board space. 
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When you use power dissipation values to determine thermal management 
considerations, use the average power unless the time duration of individual 
program segments is long. The thermal characteristics of the TMS320C40 in 
the 325-pin PGA package are exponential in nature with a time constant on 
the order of minutes. Therefore, when subjected to a change in power, the tem- 
perature of the device package will require several minutes or more to reach 
thermal equilibrium. 


If the duration of program segments exhibiting high power dissipation values 
is short (on the order of a few seconds) in comparison to the package thermal 
characteristics’ time constant, use average power calculated in the same man- 
ner as average current described in the previous section. Otherwise, calculate 
maximum device temperature on the basis of the actual time required for the 
program segments involved. For example, if a particular program segment 
lasts for 7 minutes, the device essentially reaches thermal equilibrium due to 
the total power dissipation during the period of device activity. 


Note that the average power should be determined by calculating the power 
for each program segment (including all considerations described above) and 
performing a time average of these values, rather than simply multiplying the 
average current by Vpp, as determined in the previous subsection. 


Calculate specific device temperature by using the TMS320C4x thermal 
impedance characteristics included in the TMS320C4x data sheet. 


Example Supply Current Calculations 


9.6 Example Supply Current Calculations 


9.6.1 Processing 


9.6.2 Data Output 


An FFT represents a typical DSP algorithm. The FFT code used in this calcula- 
tion processes data in the RAM blocks. The entire algorithm consists mainly 
of internal bus operations and hence includes quiescent and, in general, inter- 
nal operations. At the end of the processing, the results are written out on the 
global and local bus. Therefore, the algorithm exhibits a higher current require- 
ment during the write portion where the external bus is being used significantly. 


The processing portion of the algorithm is 95% of the total algorithm. During 
this portion, the power-supply current is required for the internal circuitry only. 
Data is processed in several loops that make up the majority of the algorithm. 
During these loops, two operands are transferred on every cycle. The current 
required for internal bus usage, then, is 60 mA (from Figure 9-3). The data is 
assumed to be random. A data value scale factor of 0.93 is used (from 
Figure 9-4). This value scales 60 mA, yielding 55.8 mA for internal bus opera- 
tions. Adding 55.8 mA to the quiescent current requirement and internal opera- 
tions current requirement yields a current requirement of 245.8 mA for the 
major portion of the algorithm. 
| = lg + liops + libus 
|= 130 mA + 60 mA + (60 mA) (0.93) 

= 245.8 mA 


The portion of the algorithm corresponding to writing out data is approximately 
5% of the total algorithm. Again, the data that is being written is assumed to 
be random. From Figure 9-4 and Figure 9-10, scale factors of 0.93 and 0.8 
are used for derating due to data value dependency for internal and local 
buses, respectively. During the data dump portion of the code, a load and a 
store are performed every cycle; however, the parallel load/store instruction 
is in an RPTS loop. Therefore, there is no contribution due to internal opera- 
tions, because the instruction is fetched only once. The only internal contribu- 
tions are due to quiescent and internal bus operations. Figure 9-5 indicates 
a 23-mA current contribution due to writes every available cycle. Therefore, 
the total contribution due to this portion of the code is: 


[= Iq + libus + Ixbus 

or 

| = 130 mA + (60 mA) (0.93) + 85 mA + (23 mA) (0.8) 
= 289.2 mA 
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9.6.3 Average Current 


The average current is derived from the two portions of the algorithm. The pro- 
cessing portion took 95% of the time and required about 245.8 mA; the data 
dump portion took the other 5% and required about 411.6 mA. The average 
is calculated as: 


lavg = (0.95) (245.8 mA) + (0.05) (289.2 mA) 
= 247.97 mA 


From the thermal characteristics specified in the TMS320C4x User’s Guide, 
it can be shown that this current level corresponds to a case temperature of 
28°C. This temperature meets the maximum device specification of 85°C and 
hence requires no forced air cooling. 


9.6.4 Experimental Results 
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A photograph of the power-supply current for the FFT, using a 40-MHz system 
clock, is shown in Appendix A. During the FFT processing, the current varied 
between 190 and 220 mA. The current during external writes had a peak of 230 
mA, and the average current requirement as measured on a digital multimeter 
was 205 mA. Scaling those results to the 50-MHz calculations yielded results 
that were close to the actual measured power-supply current. 


Design Considerations 


9.7 Design Considerations 


Designing systems for minimum power dissipation involves reducing device 
operating current requirements due to signal switching rate, capacitive load- 
ing, and other effects. Selective consideration of these effects makes it pos- 
sible to optimize system performance while minimizing power consumption. 
This section describes current reduction techniques based on operating cur- 
rent dependencies of the device as discussed in previous sections of this doc- 
ument. 


9.7.1 System Clock and Signal Switching Rates 


Since current (and therefore, power) requirements of CMOS devices are 
directly proportional to switching frequency, one potential approach to mini- 
mizing operating power is to minimize system clock frequency and signal 
switching rates. Although performance is often directly proportional to system 
clock and signal switching rates, tradeoffs can be made in both areas to 
achieve an optimal balance between power usage and performance in the 
design of a system. 


If reducing power is a primary goal, and a given system design does not have 
particularly demanding performance requirements, the system clock rate can 
be reduced with the corresponding savings in power. Minimum power is real- 
ized when system clock rates are only as fast as necessary to achieve required 
system performance. Additionally, if overall system clock rates cannot be 
reduced, an alternative approach to power reduction is to reduce clock speed 
wherever possible during periods of inactivity. 


Also, the appropriate choice of clock generation approach will ensure mini- 
mum system power dissipation. The use of an external oscillator rather than 
the on-chip oscillator can result in lower power device and system power dis- 
sipation levels. As described previously, the internal oscillator can require as 
much as 10 mA when operating at 40 MHz. If you use an external oscillator 
that requires less than 10 mA for clock generation, overall system power is 
reduced. 


When considering switching rates of signals other than the system clock, the 
main consideration is to minimize switching. Specifically, any unnecessary 
switching should be avoided. Outputs or inputs that are unused should either 
be disabled, tied high, or grounded, whichever is appropriate. Additionally, out- 
puts connected to external circuitry should drive other power dissipation ele- 
ments only when absolutely necessary. 
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9.7.2 Capacitive Loading of Signals 


Current requirements are also directly proportional to capacitive loading. 
Therefore, all capacitive loading should be minimized. This is especially signif- 
icant for device outputs. 


The approaches to minimize capacitive loading are consistent with efficient PC 
board layout and construction practices. Specifically, signal runs should be as 
short as possible, especially for signals with high switching rates. Also, signals 
should not run long distances across PC boards to edge connectors unless 
absolutely necessary. 


Note that the buffering of device outputs that must drive high capacitive loads 
reduces supply current for the TMS320C40, but this current is translated to the 
buffering device. Whether or not this is a valid tradeoff must be determined at 
the system level. The two main considerations are: 1) whether the power 
required by the buffers is more or less than the power required from the ’C40 
to drive the load in question, and 2) whether or not off-loading the power to the 
buffers has any implications with respect to system power-down modes. It may 
be desirable to use buffers to drive high capacitive loads, even though they 
may require more current than the TMS320C40, especially in cases where 
part of the system may be powered down but the TMS320C40 is still required 
to interface to other low capacitance loads. 


9.7.3 DC Component of Signal Loading 
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In order to achieve lowest device current requirements, the internal and exter- 
nal DC load component of device input and output signal loading must also be 
minimized . 


Any device inputs that are unused and left floating may cause excessively high 
DC current to be drawn by their input buffer circuitry. This occurs because if 
an input is left unconnected, the voltage on the input may float to a level that 
causes the input buffer to be biased at a point within its range of linear opera- 
tion. This can cause the input buffer circuit to draw a significant DC current 
directly from Vpp to ground. Therefore, any unused device inputs should be 
pulled up to Vpp via a resistor pullup of nominally 20 kQ, or driven high with 
an unused gate. Input-only pins that are not used can be pulled up in parallel 
with other inputs of the same type with a single gate or resistor to minimize sys- 
tem component count. In this case, up to 15 or more standard device inputs 
can be pulled up with a single resistor. 


Any device I/O pins that are unused should be selected as outputs. This avoids 
the requirement for pull-ups (to ensure that the I/O input stage is not biased 
in the linear region) and therefore eliminates an unnecessary current compo- 
nent. 


Design Considerations 


For any device output, any DC load present is directly reflected in the system’s 
power-supply current. Therefore, DC loading of outputs should be reduced to 
a minimum. If DC currents are being sourced from the address bus outputs, 
the address bus should be set to a level that minimizes the current through the 
external load. This can be accomplished by performing a dummy read from an 
external address. 


For I/O pins that must be used in both the input and output modes, individual 
pullup resistors of nominally 20 kQ should be used to ensure minimum power 
dissipation if these pins are not always driven to a valid logic state. This is par- 
ticularly true of the data-bus pins. When the bus is not being driven explicitly, 
itis left floating, which can cause excessively high currents to be drawn on the 
input buffer section of all 64 bits of the bus. In this case, because all 64 data 
bus bits are normally used independently in most applications, each data-bus 
pin should be pulled up with a separate resistor for minimum power. 
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Chapter 10 


Development Support and Part Order Information 


This chapter provides development support information, socket descriptions, 
device part numbers, and support tool ordering information for the ’C4x. 


Each ’C4x support product is described in the TMS320 Family Development 
Support Reference Guide (literature number SPRU011). In addition, more 
than 100 third-party developers offer products that support the Tl TMS320 
family. For more information, refer to the 7MS320 Third-Party Reference 
Guide (literature number SPRU052). 


For information on pricing and availability, contact the nearest TI Field Sales 
Office or authorized distributor. See the list at the back of this book. 
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Texas Instruments offers an extensive line of development tools for the 
TMS320C4x generation of DSPs, including tools to evaluate the performance 
of the processors, generate code, develop algorithm implementations, and ful- 
ly integrate and debug software and hardware modules. 


The following products support the development of ’C4x applications: 


Code Generation Tools 


QO 


Q) 


Q 


The optimizing ANSI C compiler translates ANSI C language directly into 
highly optimized assembly code. You can then assemble and link this code 
with the Tl assembler/ linker, which is shipped with the compiler. It sup- 
ports both 'C3x and ’C4x assembly code. This product is currently avail- 
able for PCs (DOS, DOS extended memory, OS/2), VAX/VMS and SPARC 
workstations. See the TMS320 Floating-Point DSP Optimizing C Compiler 
User’s Guide (SPRU034) for detailed information about this tool. 


The assembler/linker converts source mnemonics to executable object 
code. It supports both ’C3x and ’C4x assembly code. This product is cur- 
rently available for PCs (DOS, DOS extended memory, OS/2). The 
’°C3x/’C4x assembler for the VAX/VMS and SPARC workstations is only 
available as part of the optimizing 'C3x/'C4x compiler. See the TMS320 
Floating-Point DSP Assembly Language Tools User’s Guide (SPRU035) 
for detailed information about available assembly-language tools. 


The digital filter design package helps you design digital filters. 


System Integration and Debug Tools 


The simulator simulates (via software) the operation of the ’C4x and can 
be used in C and assembly software development. This product is current- 
ly available for PCs (DOS, Windows) and SPARC workstations. See the 
TMS320C4x C Source Debugger User’s Guide (SPRU054) for detailed in- 
formation about the debugger. 


The XDS510 emulator performs full-speed in-circuit emulation with the 
’C4x, providing access to all registers as well as to internal and external 
memory of the device. It can be used in C and assembly software develop- 
ment and has the capability to debug multiple processors. This product is 
currently available for PCs (DOS, Windows, OS/2) and SPARC worksta- 
tions. This product includes the emulator board (emulator box, power sup- 
ply, and SCSI connector cables in the SPARC version), the ’C4x C Source 
Debugger and the JTAG cable. 


Because ’C3x and ’C5x XDS510 emulators also come with the same emu- 
lator board (or box) as the ’C4x, you can buy the ’C4x C Source Debugger 
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Software as a separate product called ‘C4x C Source Debugger Conver- 
sion Software. This enables you to debug ’C3x/’C4x applications with the 
same emulator board. The emulator cable that comes with the ’C3x 
XDS510 emulator cannot be used with the ’C4x. A JTAG emulation con- 
version cable (see Section 10.3) is needed instead. The emulator cable 
that comes with the ‘C5x XDS510 emulator can also be used for the ’C4x 
without any restriction. See the TMS320C4x C Source Debugger User's 
Guide (SPRU054) for detailed information about the ’C4x emulator. 


(1 The parallel processing development system (PPDS) is a stand-alone 
board with four ’C4xs directly connected to each other via their commu- 
nication ports. Each ’C4x has 64K-words SRAM and 8K-byte EPROM as 
local memory, and they all share a 128K-word global SRAM. See the 
TMS320C4x Parallel Processing Development System Technical Refer- 
ence (SPRU075) for detailed information about the PPDS. 


(1 The emulation porting kit (EPK) enables you to integrate emulation 
technology directly into your system without the need of an XDS510 
board. This product is intended to be used by third parties and high-vol- 
ume board manufacturers and requires a licensing agreement with Texas 
Instruments. 


10.1.1 Third-Party Support 


The TMS320 family is supported by products and services from more than 100 
independent third-party vendors and consultants. These support products 
take various forms (both as software and hardware), from cross-assemblers, 
simulators, and DSP utility packages to logic analyzers and emulators. The ex- 
pertise of those involved in support services ranges from speech encoding and 
vector quantization to software/hardware design and system analysis. 


See the TMS320 Third-Party Support Reference Guide (literature number 
SPRU052) for a more detailed description of services and products offered by 
third parties. 


10.1.2 The DSP Hotline 


For answers to TMS320 technical questions on device problems, develop- 
ment tools, documentation, upgrades, and new products, you can contact the 
DSP hotline via: 


.) Phone: (713)274—2320 Monday through Friday from 8:30 a.m. to 5:00 
p.m. central time 


1) Fax: (713)274—2324. (US DSP Hotline), +33—1—-3070—1032 (European 
DSP hotline) 
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1) Electronic Mail: 4389750@mcimail.com 


To ask about third-party applications and algorithm development packages, 
contact the third party directly. Refer to the 7MS320 Third-Party Support Ref- 
erence Guide (SPRU052) for addresses and phone numbers. 


Extensive DSP documentation is available; this includes data sheets, user’s 
guides, and application reports. Contact the hotline for information on litera- 
ture that you can request from the Literature Response Center, 
(800)477-8924. 


The DSP hotline does not provide pricing information. Contact the nearest 
TI Field Sales Office for prices and availability of TMS320 devices and support 
tools. 


10.1.3 The Bulletin Board Service (BBS) 


The TMS320 DSP Bulletin Board Service (BBS) is a telephone-line computer 
service that provides information on TMS320 devices, specification updates 
for current or new devices and development tools, silicon and development 
tool revisions and enhancements, new DSP application software as it be- 
comes available, and source code for programs from any TMS320 user’s 
guide. 


You can access the BBS via: 


(1 Modem: (300-, 1200-, or 2400-bps) dial (713)274—2323. Set your modem 
to 8 data bits,1 stop bit, no parity. 


To find out more about the BBS, refer to the TMS320 Family Development 
Support Reference Guide (literature number SPRU0O11). 


10.1.4 Internet Services 
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Texas Instruments offers two Internet-accessible services for DSP support: an 
fip site, and a www site. 


(1 World-wide web: Point your browser at htto:/www.ti.com to access TI’s 
web site. At the site, you can follow links to find product information, online 
literature, an online lab, and the 320 Hofline online. 


() FTP: Use anonymous ftp to ti.com (Internet port address 192.94.94.1) to 
access copies of the files found on the BBS. The BBS files are located in 
the subdirectory called mirrors. 


Development Support 


10.1.5 Technical Training Organization (TTO) TMS320 Workshops 


*C4x DSP Design Workshop. This workshop is tailored for hardware and soft- 
ware design engineers and decision-makers who will be designing and utiliz- 
ing the ’C4x generation of DSP devices. Hands-on exercises throughout the 
course give participants a rapid start in developing ’C4x design skills. Micro- 
processor/assembly language experience is required. Experience with digital 
design techniques and C language programming experience is desirable. 


These topics are covered in the ’C4x workshop: 


’C4x architecture/instruction set 

Use of the PC-based software simulator 
Use of the ’C3x/’C4x assembler/linker 
C programming environment 

System architecture considerations 
Memory and I/O interfacing 
Development support 


OOUUCUOU 


For registration information, pricing, or to enroll, call (800)336-—5236, ext. 
3904. 
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10.2 Sockets 


Table 10-1 contains available sockets that accept the 325-pin ’C40 pin grid 
array (PGA) and the 304—pin ’C44 Plastic Quad Flatpack (PQF). Table 10-2 
lists the phone numbers of the manufacturers listed in Table 10-1. 


Table 10-1. Sockets that Accept the 325-pin ’C40 and the 304-pin ’C44 


Manufacturer Type Part Number 

Advanced Interconnections C40-wire-wrap socket 3919 

AMP C40-tool-activated ZIF socket AMP 382533-9 

AMP Actuation tool for AMP382533-9 AMP 854234—1 

AMP C40-handle-activated ZIF socket AMP 382320-9 

AMP C40-PGA ZIF AMP 55291-2 

Emulation Technology C40-logic analyzer socket BZ6-—325—H6A35—TMS320C40Z 
Emulation Technology C40-wire-wrap socket AB-325—H6A35Z-P 13—-M 

Mark Eyelet C40-wire-wrap socket MP325-—73311D16 

Yamaichi TMS320C44 PDB Socket (304 pins) ic201-3044—004 


Table 10-2. Manufacturer Phone Numbers 


Manufacturer Phone Number 
AMP (717) 564-0100 
Advanced Interconnections (401) 823-5200 


Emulation Technology 
Mark Eyelet 


Yamaichi 
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(408) 982-0660 
(203) 756-8847 
(408) 456-0797 


The remainder of this section describes two available sockets that accept the 
’C4x pin grid array (PGA). Both sockets feature zero insertion force (ZIF): 


.] A tool-activated ZIF socket (TAZ) 
(J Ahandle-activated ZIF socket (HAZ) 


The sockets described herein are manufactured by AMP Incorporated. 


Sockets 


10.2.1 Tool-Activated ZIF PGA Socket (TAZ) 


Figure 10-1. Tool-Activated ZIF Socket 


0.350 in. Max. 
2.061 in. Max. 
Description: 
AMP part number: 382533-9 
Pin positions: 325 
Soldertail length: 0.170 in. for PC boards 0.125 in. 
thick (other tail lengths available) 
Actuator tool 354234-1 
Features: 
(J Slightly larger than a PGA device 
[ji Easy package loading because of large funnel entry 
Lj Zero insertion force 
J Contact wiping action during insertion ensures clean contact points 
J Spring-loaded cover ensures proper loading 
[j Can be used with robotic insertion and removal 
[1 Horizontal vs. vertical socket forces prevent damage to the device 
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10.2.2 Handle-Activated ZIF PGA Socket (HAZ) 


Figure 10-2. Handle-Activated ZIF Socket 


2.700 in. Max. 


0.350 in. Max. 


0.650 La 


Description: 

AMP part number: 382320-9 

Pin positions: 325 

Solder tail length: 0.170 in. for PC boards 0.125 in. 
thick (other tail lengths available) 

Features: 


Can be used for test and burn-in 

Spring contacts are normally closed 

Easy package loading because of large funnel entry 

Zero insertion force 

Contact wiping action during socket closing ensures clean contact points 
Maximum Operating temperature is 160° C (to allow burn-in capability) 


OOUUUUU 
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10.3 Part Order Information 


This section describes the part numbers of ’C4x devices, development support 
hardware, and software tools. 


10.3.1 Nomenclature 


To designate the stages in the product development cycle, Texas Instruments 
assigns prefixes to the part numbers of all TMS320 devices and support tools. 
Each TMS320 device has one of three prefixes: TMX, TMP, or TMS. Each sup- 
port tool has one of two possible prefix designators: TMDX or TMDS. These 
prefixes represent evolutionary stages of product development from engineer- 
ing prototypes (TMX/TMDX) through fully qualified production devices and 
tools (TMS/TMDS). This development flow is defined below. 


Device Development Evolutionary Flow: 


TMX The partis an experimental device that is not necessarily representa- 
tive of the final device’s electrical specifications. 


TMP _ Thepartis adevice from a final silicon die that conforms to the device’s 


electrical specifications but has not completed quality and reliability 
verification. 


TMS _ The partis a fully qualified production device. 
Support Tool Development Evolutionary Flow: 


TMDX_ The development-support product that has not yet completed Texas 
Instruments internal qualification testing. 


TMDS The development-support product is a fully qualified development 
support product. 


TMX and TMP devices and TMDX development support tools are shipped with 
the following disclaimer: 


“Developmental product is intended for internal evaluation purposes.” 


TMS devices and TMDS development support tools have been fully character- 
ized, and the quality and reliability of the device has been fully demonstrated. 
Texas Instruments standard warranty applies to these products. 


TT | 
Note: 


Itis expected that prototype devices (TMX or TMP) have a greater failure rate 
than standard production devices. Texas Instruments recommends that 
these devices notbe used in any production system, because their expected 
end-use failure rate is still undefined. Only qualified production devices 


should be used. 
I I‘“« INN INI“ I ssIF‘-_ SSIS 
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Part Order Information 


TI device nomenclature also includes the device family name and a suffix. This 
suffix indicates the package type (for example, N, FN, or GB) and temperature 
range (for example, L). Figure 10-3 provides a legend for reading the com- 
plete device name for any TMS320 family member. 


Figure 10-3. Device Nomenclature 


TMS 320 C 40 GF L TEMPERATURE RANGE 


PREFIX | L— (AMBIENT) 


SMJ = Ceramic QML - oo : ee 
TMX = experimental device L= Oto70°C 


TMP = prototype device 
TMS = qualified device 
SMQ = Plastic QML 


M = -55 to 125°C 
S = -55 to 100°C 


PACKAGE TYPE 
FD =ceramic leadless CC 
FN =plastic leaded CC 
FZ =ceramic CER-QUAD 
GB = 181-pin ceramic PGA 
GE =181-pin ceramic PGA 
GF =325-pin ceramic PGA 


DEVICE FAMILY 
320 = TMS320 Family 


TECHNOLOGY: HFH = 352-leaded CER-QFP 
J = ceramic DIP 
= CM 
ss Puce EPROM JD = ceramic DIP, side-brazed 


N = plastic DIP 

TA = tape automated bonding 
(encapsulated) 

TB =tape automated bonding 
(bare die) 

KGD = known good die 

PDB = 304-pin plastic quad 


flatpack 
DEVICE 


10.3.2 Device and Development Support Tools 


Table 10-3 lists 'C4x device part numbers. Table 10-4 lists the development 
support tools available for the 'C4x DSP, their part numbers, and the platform 
on which they run. 
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Table 10-3. Device Part Numbers 


Device Part Number 


TMS320C40GFL 
TMS320C40GFL60 
TMS320C44PDB50 
TMS320C44PDB60 
SMJ320C40GFM40 
SMJ320C40GFM50 
SMJ320C40HFHM40 
SMJ320C40HFHM50 
SMJ320C40TAM40 
SMJ320C40TBM40 
TMS320C40TAL50 
SMJ320C40TAM50 
SMJ320C40TBM50 
TMS320C40TAL60 
SMJ320C40KGDM40 
SMJ320C40KGDM50 
TMS320C40KGDL50 
TMS320C40KGDL60 


Voltage 


5V 
5V 
5V 
5V 
5V 
5V 


Operating 
Frequency 


50 MHz/40 ns 
60 MHz/33 ns 
50 MHz/40 ns 
60 MHz/33 ns 
40MHz/50 ns 
50MHz/40 ns 
40MHz/50 ns 
50MHz/40 ns 
40MHz/50ns 
40MHz/50ns 
50MHz/40ns 
50MHz/40ns 
50MHz/40ns 
60MHz/33ns 
40MHz/50ns 
50MHz/40ns 
50MHz/40ns 
60MHz/33ns 


mo Don Da Dd DdbsDdoe Dds doe Dmd Dd Dd Dd Dodo oOo F&F FP OD 


Part Order Information 


Package 


325-pin ceramic PGA 
325-pin ceramic PGA 
304-pin PQFP 

304-pin PQFP 

325-pin ceramic PGA 
325-pin ceramic PGA 
352-lead ceramic PGA 


352-lead ceramic PGA 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (bare die) 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (encapsulated) 


324 pad TAB tape (bare die) 


324 pad TAB tape (encapsulated) 


Known Good Die 
Known Good Die 
Known Good Die 


Known Good Die 
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Table 10-4. Development Support Tools Part Numbers 


Development Tool 

C Compiler/Assembler/Linker 

C Compiler/Assembler/Linker 

C Compiler/Assembler/Linker 

Assembler/Linker 

Simulator (C language) 

Simulator (C language) 

Tartan Floating Point Library 

Tartan Floating Point Library 

Digital Filter Design Package 

C Source Debugger Conversion Software 

C Source Debugger Conversion Software 
Emulation Porting Kit 

’°C3x/C4x Tartan C/C++ Compiler/Assembler/Linker 
’C3x/C4x Tartan C/C++ Compiler/Assembler/Linker 


’C8x/’C4x Tartan C/C++ Compiler/Assembler/Linker/ 
Simulator 


’C8x/’C4x Tartan C/C++ Compiler/Assembler/Linker/ 
Simulator 


’C3x/C4x Tartan C/C++ XDS510 Debugger 
’C3x/C4x Tartan C/C++ XDS510 Debugger 
XDS510 Emulatort 

XDS510WS Emulator$ 

PC/Sparc JTAG Emulation Cable 


Parallel Processing Development System 


t Requires licensing agreement. 


+ Includes XDS510WS box, SCSI cable, power supply, and JTAG cable. TMDS3240640 C-source debugger software not 


included. 


Part Number 


TMDS3243855-02 
TMDS3243255-08 
TMDS3243555-08 
TMDS3243850-02 
TMDS3244851 -02 
TMDS3244551-09 
320FLO-PC-C40 
320FLO-SUN-C40 
DFDP 
TMDS3240140 
TMDS3240640 
TMDX3240040T 
TAR-CCM-PC 
TAR-CCM-SP 
TAR-SIM—PC 


TAR-SIM-SP 


TAR-DEG-XDS-PC 
TAR-DEG-XDS-—SP 
TMDS3260140 
TMDS3260640 
TMDS3080001 


TMDX3261040 


Platform 

PC (DOS, OS/2) 
VAX (VMS) 
SPARC (Sun OS) 
PC (DOS) 

PC (DOS, Windows) 
SPARC (Sun OS) 
PC (DOS) 
SPARC (Sun OS) 
PC (DOS) 

PC (XDS510) 
Sun (XDS510WS) 
PC (DOS) 
SPARC 

PC (DOS) 


SPARC 


PC (DOS, Windows) 
SPARC (Sun OS) 


PC (DOS, OS/2, Windows) 


Sun (SPARC SCSI) 
XDS$510/XDS510WS 


XDS510/XDS510WS 


§ Includes XDS510 board and JTAG cable. TMDS3240140 C-source debugger software not included. 
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Chapter 11 


XD$510 Emulator Design Considerations 


This chapter explains the design requirements of the XDS510 emulator with 
respect to JTAG designs, and discusses the XDS510 cable (manufacturing 
part number 261 7698-0001). This cable is identified by a label on the cable pod 
marked JTAG 3/5V and supports both standard 3-volt and 5-volt target system 
power inputs. 


The term JTAG, as used in this book, refers to Tl scan-based emulation, which 
is based on the IEEE 1149.1 standard. 


Topic Page 
11.1 Designing Your Target System’s .............. cece cece eee eens 11-2 
Emulator Connector (14-Pin Header) 
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11.8 Mechanical Dimensions for the 14-Pin Emulator Connector .... 11-12 
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Designing Your Target System’s Emulator Connector (14-Pin Header) 


11.1 Designing Your Target System’s Emulator Connector (14-Pin Header) 


JTAG target devices support emulation through a dedicated emulation port. 
This port is a superset of the IEEE 1149.1 standard and is accessed by the 
emulator. To communicate with the emulator, your target system must have a 
14-pin header (two rows of seven pins) with the connections that are shown 
in Figure 11-1. Table 11-1 describes the emulation signals. 


Figure 11-1. 14-Pin Header Signals and Header Dimensions 


TMS TRST 
i BNP pee ee) oo in. (X,Y) 
PD (Vcc) no pin (key)t Pin width, 0.025-in. square post 
TDO GND Pin length, 0.235-in. nominal 
TCK_RET GND 
TCK GND 
EMUO EMU1 


Tt While the corresponding female position on the cable connector is plugged to prevent improper 
connection, the cable lead for pin 6 is present in the cable and is grounded, as shown in the sche- 
matics and wiring diagrams in this document. 


Table 11-1. 14-Pin Header Signal Descriptions 
Emulatort Targett 


Signal Description State State 
TMS Test mode select O I 
TDI Test data input O I 
TDO Test data output I O 
TCK Test clock. TCK is a 10.368-MHz clock O I 


source from the emulation cable pod. This 
signal can be used to drive the system test 


clock 
TRSTt Test reset O | 
EMUO Emulation pin 0 I V0 
EMU1 Emulation pin 1 | V0 
PD(Vcc) Presence detect. Indicates that the emula- | O 


tion cable is connected and that the target is 
powered up. PD should be tied to Vcc in the 
target system. 


TCK_RET_ Test clock return. Test clock input to the I O 
emulator. May be a buffered or unbuffered 
version of TCK. 


GND Ground 


TI = input; O = output 

+Do not use pullup resistors on TRST: it has an internal pulldown device. In a low-noise 
environment, TRST can be left floating. In a high-noise environment, an additional pulldown 
resistor may be needed. (The size of this resistor should be based on electrical current 
considerations.) 


11.2 Bus Protocol 


Designing Your Target System’s Emulator Connector (14-Pin Header) 


Although you can use other headers, recommended parts include: 


straight header, unshrouded DuPont Connector Systems 
part numbers: 65610-1114 

65611-114 

67996-114 

67997-114 


The IEEE 1149.1 specification covers the requirements for the test access port 
(TAP) bus slave devices and provides certain rules, summarized as follows: 


(1 The TMS/TDI inputs are sampled on the rising edge of the TCK signal of 
the device. 


(1 The TDO output is clocked from the falling edge of the TCK signal of the 
device. 


When these devices are daisy-chained together, the TDO of one device has 
approximately a half TCK cycle setup to the next device’s TDI signal. This type 
of timing scheme minimizes race conditions that would occur if both TDO and 
TDI were timed from the same TCK edge. The penalty for this timing scheme 
is a reduced TCK frequency. 


The IEEE 1149.1 specification does not provide rules for bus master (emula- 
tor) devices. Instead, it states that it expects a bus master to provide bus slave 
compatible timings. The XDS510 provides timings that meet the bus slave 
rules. 


11.3 IEEE 1149.1 Standard 


For more information concerning the IEEE 1149.1 standard, contact IEEE 
Customer Service: 


Address: IEEE Customer Service 
445 Hoes Lane, PO Box 1331 
Piscataway, NJ 08855-1331 


Phone: (800) 678—IEEE in the US and Canada 
(908) 981-1393 outside the US and Canada 


FAX: (908) 981-9667 Telex: 833233 
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JTAG Emulator Cable Pod Logic 


11.4 JTAG Emulator Cable Pod Logic 


Figure 11-2 shows a portion of the emulator cable pod. These are the function- 
al features of the pod: 


[1 Signals TDO and TCK_RET can be parallel-terminated inside the pod if 
required by the application. By default, these signals are not terminated. 


1) Signal TCK is driven with a 74LVT240 device. Because of the high-current 
drive (32 mA Io, /Ion), this signal can be parallel-terminated. If TCK is tied 
to TCK_RET, then you can use the parallel terminator in the pod. 


[J Signals TMS and TDI can be generated from the falling edge of TCK_RET, 
according to the IEEE 1149.1 bus slave device timing rules. 


LJ Signals TMS and TDI are series-terminated to reduce signal reflections. 


LJ A 10.368-MHz test clock source is provided. You may also provide your 
own test clock for greater flexibility. 


Figure 11-2. JTAG Emulator Cable Pod Interface 


+5 V 
$ v 74F175 
270 
‘ JP1 
TDO (Pin 7) 
74LVT240 
10.368 MHz 
TMS (Pin 1) 
GND (Pins 4,6,8,10,12) 
TDI (Pin 3) 
EMUO (Pin 13) 
74AS1034 
EMU1 (Pin 14) Be TCK (Pin 11)1 
+5 V = 
180 Q z = 270 Q TRST (Pin 2) 
JP2 74AS1004 
TCK_RET (Pin 9)t t >. 


PD(Vcc) (Pin 5) > 
1002 


RESIN 
TL7705A 


T The emulator pod uses TCK_RET asits clock source for internal synchronization. TCK is provided 
as an optional target system test clock source. 


JTAG Emulator Cable Pod Signal Timing 


11.5 JTAG Emulator Cable Pod Signal Timing 


Figure 11-3 shows the signal timings for the emulator cable pod. Table 11-2 
defines the timing parameters. These timing parameters are calculated from 
values specified in the standard data sheets for the emulator and cable pod 
and are for reference only. Texas Instruments does not test or guarantee these 
timings. 


The emulator pod uses TCK_RET as its clock source for internal synchroni- 
zation. TCK is provided as an optional target system test clock source. 


Figure 11-3. JTAG Emulator Cable Pod Timings 


1 > 
TCK_RET , : 1.5V 
an —— 3 ——_» 
TMS/TDI 
l«—4 a 
. le 6 > 


Table 11-2. Emulator Cable Pod Timing Parameters 


No. 


oa Fw hPMD + 


Reference Description Min Max Units 
tc(TCK) TCK_RET period 35 200 ns 
tw(TCKH) TCK_RET high-pulse duration 15 ns 
tw(TCKL) TCK_RET low-pulse duration 15 ns 
ta(TMS) Delay time, TMS/TDI valid from TCK_RET low 6 20° ns 
tsu(TDO) TDO setup time to TCK_RET high 3 ns 
th(TDO) TDO hold time from TCK_RET high 12 ns 
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Emulation Timing Calculations 


11.6 Emulation Timing Calculations 


The following examples help you calculate emulation timings in your system. 
For actual target timing parameters, see the appropriate device data sheets. 


Assumptions: 


tsucTTMs) Target TMS/TDI setup to TCK high 10 ns 
td(TTDO) Target TDO delay from TCK low 15ns 
td(oufmax) Target buffer delay, maximum 4Ons 
td(bufmin) Target buffer delay, minimum 1ns 
l(bufskew) Target buffer skew between two devices 1.35 ns 


in the same package: 
[td(bufmax) — ta(butmin)] x 0.15 


t(TCKfactor) | Assume a 40/60 duty cycle clock 0.4 
(40%) 
Given in Table 11-2 ( on page 11-5): 
ta(TMSmax) | Emulator TMS/TDI delay from TCK_RET 20 ns 
low, maximum 
tsu(TDOmin) | TDO setup time to emulator TCK_RET 3 ns 


high, minimum 


There are two key timing paths to consider in the emulation design: 


LY The TCK_RET-to-TMS/TDI path, called toq(TckK_RET-TMS/TDI) 
LJ The TCK_RET-to-TDO path, called toq(TcK_RET-TDO) 


Of the following two cases, the worst-case path delay is calculated to deter- 
mine the maximum system test clock frequency. 


Case 1: Single processor, direct connection, TMS/TDI timed from TCK_RET low. 


E (Tusmax) * 'su sws)| 
tod (TCK_RET-TMS/TDI) ~ t 


TCKfactor) 

[20ns + 10ns] 
0.4 

75ns (13.3 MHz) 


ty (TTD0) + tsu sport 


t i = 
pd (TCK_RET-TDO) titcktactor 


_ [15ns + 3ns] 
7 0.4 
= 45ns (22.2 MHz) 


In this case, the TCK_RET-to-TMS/TDI path is the limiting factor. 


Emulation Timing Calculations 
Case 2: Single/multiprocessor, TMS/TDI/TCK buffered input, TDO buffered output, 
TMS/TDI timed from TCK_RET low. 


7 Its (TMSmax) + tsucttms) + t patekeu)| 
tod (TCK_RET-TMS/TDI) = 7 


TCKfactor) 


_ [20ns + 10ns + 1.35ns| 
0.4 


= 78.4ns (12.7 MHz) 


[ta (TTDO) + tsuctDominy + ta iia 
toa (TCK_RET-TDO) — t | 


TCKfactor) 


_ [15ns + 3ns + 10ns] 
> 0.4 


= 70ns (14.3 MHz) 


In this case, the TCK_RET-to-TMS/TDI path is the limiting factor. 


In amultiprocessor application, itis necessary to ensure that the EMU0-1 lines 
can go from a logic low level to a logic high level in less than 10 us. This can be 
calculated as follows: 


tr = 5(Roullup X Ndevices * Cload_per_device) 
= 5(4.7 kQ x16 x 15 pF) 
5.64 us 


XDS510 Emulator Design Considerations 11-7 


Connections Between the Emulator and the Target System 


11.7 Connections Between the Emulator and the Target System 


It is extremely important to provide high-quality signals between the emulator 
and the JTAG target system. Depending upon the situation, you must supply 
the correct signal buffering, test clock inputs, and multiple processor intercon- 
nections to ensure proper emulator and target system operation. 


Signals applied to the EMUO and EMU1 pins on the JTAG target device can 
be either input or output (I/O). In general, these two pins are used as both input 
and output in multiprocessor systems to handle global run/stop operations. 
EMUO and EMU1 signals are applied only as inputs to the XDS510 emulator 
header. 


11.7.1 Buffering Signals 


If the distance between the emulation header and the JTAG target device is 
greater than six inches, the emulation signals must be buffered. If the distance 
is less than six inches, no buffering is necessary. The following illustrations 
depict these two situations. 


Li No signal buffering. In this situation, the distance between the header 
and the JTAG target device should be no more than six inches. 


— 6 Inches or Less ™ 


Voc Wee 
JTAG Device Emulator Header 4 
EMUO EMUO PD 
EMU1 oO EMU1 
TRST TRST GND 
TMS TMS GND 
TDI TDI GND 
TDO TDO GND 
TCK ’ TCK GND 


TCK_RET 


V 
GND 


The EMUO and EMU1 signals must have pullup resistors connected to Vcc to 
provide a signal rise time of less than 10 us. A 4.7-kQ resistor is suggested for 
most applications. 


Connections Between the Emulator and the Target System 


(1 Buffered transmission signals. In this situation, the distance between 
the emulation header and the processor is greater than six inches. Emula- 
tion signals TMS, TDI, TDO, and TCK_RET are buffered through the same 
package. 


Greater Than 


i 6 Inches —F 


Voc 


"ge 
JTAG Device Emulator Header 
EMUO EMUO PD 
EMU1 EMU1 
TRST TRST 
TMS TMS 
TDI TDI 
TDO > TDO 
TCK TCK 
> TCK_RET Vv 


GND 


mM TheEMUO and EMU1 signals must have pullup resistors connected to 
Vcc to provide a signal rise time of less than 10 us. A 4.7-kQ resistor is 
suggested for most applications. 


m The input buffers for TMS and TDI should have pullup resistors con- 
nected to Vcc to hold these signals at a known value when the emula- 
tor is not connected. A resistor value of 4.7 kQ or greater is suggested. 


m To have high-quality signals (especially the processor TCK and the 
emulator TCK_RET signals), you may have to employ special care 
when routing the PWB trace. You also may have to use termination 
resistors to match the trace impedance. The emulator pod provides 
optional internal parallel terminators on the TCK_RET and TDO. TMS 
and TDI provide fixed series termination. 


m Since TRST is an asynchronous signal, it should be buffered as 
needed to insure sufficient current to all target devices. 
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Connections Between the Emulator and the Target System 


11.7.2 Using a Target-System Clock 


Figure 11-4 shows an application with the system test clock generated in the 
target system. In this application, the TCK signal is left unconnected. 


Figure 11-4. Target-System-Generated Test Clock 


Greater Than 


6 Inches 
Vcc 
Vcc 
JTAG Device Emulator Header A 
EMUO PD 
I EMU1 
TRST 
<J e TS 
e TDI 
> TDO 
NC TCK 
> TCK_RET Vv 


GND 


System Test Clock 


Note: Whenthe TMS/TDI lines are buffered, pullup resistors should be used to hold the buffer 
inputs at a known level when the emulator cable is not connected. 


There are two benefits to having the target system generate the test clock: 


(i The emulator provides only a single 10.368-MHz test clock. If you allow 
the target system to generate your test clock, you can set the frequency 
to match your system requirements. 


[1 In some cases, you may have other devices in your system that require 
a test clock when the emulator is not connected. The system test clock 
also serves this purpose. 
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Connections Between the Emulator and the Target System 


11.7.3 Configuring Multiple Processors 


Figure 11-5 shows a typical daisy-chained multiprocessor configuration, 
which meets the minimum requirements of the IEEE 1149.1 specification. The 
emulation signals in this example are buffered to isolate the processors from 
the emulator and provide adequate signal drive for the target system. One of 
the benefits of this type of interface is that you can generally slow down the test 
clock to eliminate timing problems. You should follow these guidelines for 
multiprocessor support: 


[1 The processor TMS, TDI, TDO, and TCK signals should be buffered 
through the same physical package for better control of timing skew. 


(1 The input buffers for TMS, TDI, and TCK should have pullup resistors con- 
nected to Vcc to hold these signals at a known value when the emulator 
is not connected. A resistor value of 4.7 kQ or greater is suggested. 


(7 Buffering EMU0O and EMU1 is optional but highly recommended to provide 
isolation. These are not critical signals and do not have to be buffered 
through the same physical package as TMS, TCK, TDI, and TDO. Unbuf- 
fered and buffered signals are shown in this section (page 11-8 and page 
11-9). 


Figure 11-5. Multiprocessor Connections 


JTAG Device JTAG Device 


Vcc 
Emulator Header 


EMUO PD 


EMU1 


» sd TRST 


Se TMS 


TDI 


St TDO 


e St ° TCK 


TCK_RET V 
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11.8 Mechanical Dimensions for the 14-Pin Emulator Connector 


The JTAG emulator target cable consists of a 3-foot section of jacketed cable, 
an active cable pod, and a short section of jacketed cable that connects to the 
target system. The overall cable length is approximately 3 feet 10 inches. 
Figure 11-6 and Figure 11—7 (page 11-13) show the mechanical dimensions 
for the target cable pod and short cable. Note that the pin-to-pin spacing on 
the connector is 0.100 inches in both the X and Y planes. The cable pod box 
is nonconductive plastic with four recessed metal screws. 


Figure 11-6. Pod/Connector Dimensions 


2.70 S 
4.50 
ime 9.50 
s @S 
— 0.90 S 
Emulator Cable Pod wow Connector 
EES 
Short, Jacketed Cable !Se 


Refer to Figure 11—7. 


Note: All dimensions are in inches and are nominal dimensions, unless otherwise specified. 
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Mechanical Dimensions for the 14-Pin Emulator Connector 


Figure 11-7. 14-Pin Connector Dimensions 


a 
Cable 
0.66 
=F. 
Connector, Side View 
Key, Pin 6 
0.100 —| — 
oo 
0.87 
Cable 
0.100 
Y 


Connector, Front View 
Pins 1, 3, 5, 7,9, 11, 13 Pins 2, 4, 6, 8, 10, 12, 14 


Note: All dimensions are in inches and are nominal dimensions, unless otherwise specified. 
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11.9 Emulation Design Considerations 


This section describes the scan path linker (SPL), which can simultaneously 
add all four secondary JTAG scan paths to the main scan path. It also de- 
scribes how to use the emulation pins and configure multiple processors. 


11.9.1 Using Scan Path Linkers 
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You can use the Tl ACT8997 scan path linker (SPL) to divide the JTAG 
emulation scan path into smaller, logically connected groups of 4 to 16 
devices. As described in the Advanced Logic and Bus Interface Logic Data 
Book (literature number SCYDO01), the SPL is compatible with the JTAG 
emulation scanning. The SPL is capable of adding any combination of its four 
secondary scan paths into the main scan path. 


A system of multiple, secondary JTAG scan paths has better fault tolerance 
and isolation than a single scan path. Since an SPL has the capability of adding 
all secondary scan paths to the main scan path simultaneously, it can support 
global emulation operations, such as starting or stopping a selected group of 
processors. 


Tl emulators do not support the nesting of SPLs (for example, an SPL 
connected to the secondary scan path of another SPL). However, you can 
have multiple SPLs on the main scan path. 


Although the ACT8999 scan path selector is similar to the SPL, it can add only 
one of its secondary scan paths at a time to the main JTAG scan path. Thus, 
global emulation operations are not assured with the scan path selector. For 
this reason, scan path selectors are not supported. 


You can insert an SPL on a backplane so that you can add up to four device 
boards to the system without the jumper wiring required with nonbackplane 
devices. You connect an SPL to the main JTAG scan path in the same way you 
connect any other device. Figure 11—8 shows you how to connect a secondary 
scan path to an SPL. 


Figure 11-8. Connecting a Secondary JTAG Scan Path to an SPL 


Emulation Design Considerations 


¢ 
SPL 
DTCK Tpo1 YTAGO 
TDI | DTDOO TMS 
° TMS | DTMSO TCK 
= TCK  DTDIO TRST 
TAST | ora TDO 
TDO | pTms1 
DTDII TD) TAG N 
DTDO2 TMS 
DTMS2 TCK 
DTDI2 TRST 
DTDOS3 TDO 


DTMS3 
DTDI3 


The TRST signal from the main scan path drives all devices, even those on 
the secondary scan paths of the SPL. The TCK signal on each target device 
on the secondary scan path of an SPL is driven by the SPL’s DTCK signal. The 
TMS signal on each device on the secondary scan path is driven by the respec- 
tive DTMS signals on the SPL. 


DTDO on the SPL is connected to the TDI signal of the first device on the sec- 
ondary scan path. DTDI on the SPL is connected to the TDO signal of the last 
device in the secondary scan path. Within each secondary scan path, the TDI 
signal of a device is connected to the TDO signal of the device before it. If the 
SPL is on a backplane, its secondary JTAG scan paths are on add-on boards; 
if signal degradation is a problem, you may need to buffer both the TRST and 
DTCK signals. Although less likely, you may also need to buffer the DTMSn 
signals for the same reasons. 
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11.9.2 Emulation Timing Calculations for SPL 


The following examples help you to calculate the emulation timings in the SPL 
secondary scan path of your system. For actual target timing parameters, see 
the appropriate device data sheets. 


Assumptions: 


tsu(TTMS) Target TMS/TDI setup to TCK high 10 ns 
ta(TTDO) Target TDO delay from TCK low ie as 
td(bufmax) Target buffer delay, maximum 10 ns 
td(bufmin) Target buffer delay, minimum + ne 
\(oufskew) Target buffer skew between two devices 1.35 ns 


in the same package: 
[td(bufmax) — ta(bufmin)] x 0.15 


t(TCKfactor) Assume a 40/60 duty cycle clock ; ee 


Given in the SPL data sheet: 


td(DTMSmax) SPL DTMS/DTDO delay from TCK 31 ns 
low, maximum 

tsu(DTDLmin) DTDI setup time to SPL TCK 7 ns 
high, minimum 

ta(DTCKHmin) SPL DTCK delay from TCK 2ns 
high, minimum 

tq(DTCKLmax) SPL DTCK delay from TCK 16 ns 


low, maximum 


There are two key timing paths to consider in the emulation design: 


LY The TCK-to-DTMS/DTDO path, called tog(Tck-DTMs) 
LJ The TCK-to-DTDI path, called togTck-DTDI) 
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Of the following two cases, the worst-case path delay is calculated to deter- 
mine the maximum system test clock frequency. 


Case 1: Single processor, direct connection, DTMS/DTDO timed from TCK low. 
tg(oTMSmax) + ty(oTcKHmin) * ‘su (TTMs) 
t = 
pais cae) UtcKfactor) 
_ [8ins + 2ns + 10ns] 
7 0.4 


= 107.5ns (9.3 MHz) 


+t 


ty (TTDO) a ty (DTCKLmax) su (DTDLmin) 


t 
Paes -tcktactor) 


_ [15ns + 16ns + 7ns] 
~ 0.4 


= 9.5ns (10.5 MHz) 
In this case, the TCK-to-DTMS/DTDL path is the limiting factor. 
Case 2: Single/multiprocessor, DTMS/DTDO/TCK buffered input, DTDI buffered out- 
put, DTMS/DTDO timed from TCK low. 


tg (oTMSmax) + 'TtCKHmin) + 'sucttms) + 'butskew) 
tod (TCK-TDMs) = i 


(TCKfactor) 


[Sins + 2ns + 10ns + 1.35ns] 
0.4 


110.9ns (9.0 MHz) 


tattpo) + ta(otcKLmax) + tsu(DTDLmin * ty (outskew) 
tod (TCK-DTDI) = i 


(TCKfactor) 


_ [15ns + 15ns + 7ns + 10ns] 
7 0.4 


= 120ns (8.3 MHz) 


In this case, the TCK-to-DTDI path is the limiting factor. 
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11.9.3 Using Emulation Pins 
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The EMUO0/1 pins of TI devices are bidirectional, three-state output pins. When 
in an inactive state, these pins are at high impedance. When the pins are 
active, they function in one of the two following output modes: 


1) Signal Event 
The EMU0/1 pins can be configured via software to signal internal events. 
In this mode, driving one of these pins low can cause devices to signal 
such events. To enable this operation, the EMU0/1 pins function as open- 
collector sources. External devices such as logic analyzers can also be 
connected to the EMUO/1 signals in this manner. If such an external 
source is used, it must also be connected via an open-collector source. 


1) External Count 

The EMU0/1 pins can be configured via software as totem-pole outputs 
for driving an external counter. These devices can be damaged if the out- 
put of more than one device is configured for totem-pole operation. The 
emulation software detects and prevents this condition. However, the 
emulation software has no control over external sources on the EMUO0/1 
signal. Therefore, all external sources must be inactive when any device 
is in the external count mode. 


Tl devices can be configured by software to halt processing if their EMUO0/1 
pins are driven low. This feature, in combination with the use of the signal event 
output mode, allows one TI device to halt all other Tl devices on a given event 
for system-level debugging. 


If you route the EMU0/1 signals between boards, they require special handling 
because these signals are more complex than normal emulation signals. 
Figure 11-9 shows an example configuration that allows any processor in the 
system to stop any other processor in the system. Do not tie the EMU0/1 pins 
of more than 16 processors together in a single group without using buffers. 
Buffers provide the crisp signals that are required during a RUNB (run bench- 
mark) debugger command or when the external analysis counter feature is 
used. 


Figure 11-9. EMUO0/1 Configuration 
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Notes: 1) The low time on EMUx-IN should be at least one TCK cycle and less than 10 us. Software will set the EMUx-OUT 


pin to a high state. 


2) To enable the open-collector driver and pullup resistor on EMU1 to provide rising/falling edges of less than 25 ns, 
the modification shown in this figure is suggested. Rising edges slower than 25 ns can cause the emulator to detect 
false edges during the RUNB command or when the external counter selected from the debugger analysis menu 


is used. 


These seven important points apply to the circuitry shown in Figure 11—9 and 


the timing shown in Figure 11—10: 


gether on each board. 


[J Open-collector drivers isolate each board. The EMU0/1 pins are tied to- 


.j At the board edge, the EMU0/1 signals are split to provide IN/OUT. This 


is required to prevent the open-collector drivers from acting as a latch that 


can be set only once. 


installed as required. 


[1 The EMU0/1 signals are bused down the backplane. Pullup resistors are 


Li The bused EMU0/1 signals go into a PAL® device whose function is to 


generate a low pulse on the EMU0/1-IN signal when a low level is detected 
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on the EMU0/1-OUT signal. This pulse must be longer than one TCK 
period to affect the devices, but less than 10 us to avoid possible conflicts 
or retriggering, once the emulation software clears the device’s pins. 


During a RUNB debugger command or other external analysis count, the 
EMU0/1 pins on the target device become totem-pole outputs. The EMU1 
pin is a ripple carry-out of the internal counter. EMUO becomes a 
processor-halted signal. During a RUNB or other external analysis count, 
the EMUO0/1-IN signal to all boards must remain in the high (disabled) 
state. You must provide some type of external input (CCNT_ENABLE) to 
the PAL to disable the PAL from driving EMUO0/1-IN to a low state. 


If sources other than TI processors (such as logic analyzers) are used to 
drive EMU0/1, their signal lines must be isolated by open-collector drivers 
and be inactive during RUNB and other external analysis counts. 


You must connect the EMU0/1-OUT signals to the emulation header or di- 
rectly to a test bus controller. 
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Figure 11-10.Suggested Timings for the EMU0 and EMU1 Signals 


EMU0/1-OUT NEN 
EMU0/1-IN NON / 


Figure 11-11. EMU0/1 Configuration With Additional AND Gate to Meet Timing 
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Notes: 1) Thelowtime on EMUx-IN should be at least one TCK cycle and less than 10 us. Software will set the EMUx—OUT 
pin to a high state. 


2) To enable the open-collector driver and pullup resistor on EMU1 to provide rising/falling edges of less than 25 ns, 
the modification shown in this figure is suggested. Rising edges slower than 25 ns can cause the emulator to detect 
false edges during the RUNB command or when the external counter selected from the debugger analysis menu 
is used. 
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If itis not important that the devices on one target board are stopped by devices 
on another target board via the EM0/1, then the circuit in Figure 11-12 can be 
used. In this configuration, the global-stop capability is lost. It is important not 
to overload EMU0/1 with more than 16 devices. 


Figure 11—-12.EMUO0/1 Configuration Without Global Stop 


Pullup Resistor | | 
EMUO/1 


| | 

1 es ‘ 
To Emulator | | 
_I 


EMU0O/1 Target Board 1 


Target Board m 


Pullup Resistor 


© ai © EMUO/1 
Device Device 
1 a ; 
Le ae as ees = 


Note: The open-collector driver and pullup resistor on EMU1 must be able to provide rising/falling edges of less than 25 ns. 
Rising edges slower than 25 ns can cause the emulator to detect false edges during the RUNB command or when the 
external counter selected from the debugger analysis menu is used. If this condition cannot be met, then the EMU0/1 
signals from the individual boards should be ANDed together (as shown in Figure 1-11 ) to produce an EMU0/1 signal for 
the emulator. 
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11.9.4 Performing Diagnostic Applications 


For systems that require built-in diagnostics, it is possible to connect the 
emulation scan path directly to a Tl ACT8990 test bus controller (TBC) instead 
of the emulation header. The TBC is described in the Texas Instruments Ad- 
vanced Logic and Bus Interface Logic Data Book (literature number 
SCYD001). Figure 11-13 shows the scan path connections of ndevices to the 
TBC. 


Figure 11-13. TBC Emulation Connections for n JTAG Scan Paths 


4 


TBC tcki 


TDO 

TMSO 

TMS1 
TMS2/EVNTO 
TMS3/EVNT1 
TMS4/EVNT2 
TMS5/EVNT3 ————} 
TCKO 

TDIO 
TDI 


In the system design shown in Figure 1-13, the TBC emulation signals TCKI, 
TDO, TMSO, TMS2/EVNTO, TMS3/EVNT1, TMS5/EVNT3, TCKO, and TDIO 
are used, and TMS1, TMS4/EVNT2, and TDI1 are not connected. The target 
devices’ EMUO0 and EMU1 signals are connected to Vcc through pullup resis- 
tors and tied to the TBC’s TMS2/EVNT0 and TMS3/EVNT1 pins, respectively. 
The TBC’s TCKI pin is connected to a clock generator. The TCK signal for the 
main JTAG scan path is driven by the TBC’s TCKO pin. 
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On the TBC, the TMSO pin drives the TMS pins on each device on the main 
JTAG scan path. TDO on the TBC connects to TDI on the first device on the 
main JTAG scan path. TDIO on the TBC is connected to the TDO signal of the 
last device on the main JTAG scan path. Within the main JTAG scan path, the 
TDI signal of a device is connected to the TDO signal of the device before it. 
TRST for the devices can be generated either by inverting the TBC’s 
TMS5/EVNTS signal for software control or by logic on the board itself. 


PAN 0) of=Yalo | @7N 
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AO-A30: External address pins for data/program memory or I/O devices. 
These pins are on the global bus. See also LAO-LA30. 


address: The location of program code or data stored in memory. 


addressing mode: The method by which an instruction interprets its oper- 
ands to acquire the data it needs. 


ALU: See Arithmetic logic unit. 


analog-to-digital (A/D) converter: A successive-approximation converter 
with internal sample-and-hold circuitry used to translate an analog signal 
to a digital signal. 


ARAU: See auxiliary register arithmetic unit. 


arithmetic logic unit (ALU): The part of the CPU that performs arithmetic 
and logic operations. 


auxiliary registers (ARn): A set of registers used primarily in address gen- 
eration. 


auxiliary register arithmetic unit (ARAU): Auxiliary register arithmetic 
unit. A16-bit arithmetic logic unit (ALU) used to calculate indirect ad- 
dresses using the auxiliary registers as inputs and outputs. 


bit-reversed addressing: Addressing in which several bits of an address 
are reversed in order to speed processing of algorithms, such as Fourier 
transforms. 


BK: See block-size register. 
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block-size register: A register used for defining the length of a program 
block to be repeated in repeat mode. 


bootloader: A built-in segment of code that transfers code from an external 
memory or from a communication port to RAM at power-up. 


carry bit: Abitin status register ST1 used by the ALU for extended arithme- 
tic operations and accumulator shifts and rotates. The carry bit can be 
tested by conditional instructions. 


circular addressing: An addressing mode in which an auxiliary register is 
used to cycle through a range of addresses to create a circular buffer in 
memory. 


context save/restore: Asave/restore of system status (status registers, ac- 
cumulator, product register, temporary register, hardware stack, and 
auxiliary registers, etc.) when the device enters/exits a subroutine such 
as an interrupt service routine. 


CPU: Central processing unit. The unit that coordinates the functions of a 
processor. 


CPUcycle: Thetime it takes the CPU to go through one logic phase (during 
which internal values are changed) and one latch phase (during which 
the values are held constant). 


cycle: See CPU cycle. 


DO-D31: External data bus pins that transfer data between the processor 
and external data/program memory or I/O devices. See also LDO-LD31. 


data-address generation logic: Logic circuitry that generates the address- 
es for data memory reads and writes. This circuitry can generate one ad- 
dress per machine cycle. See also program-address generation logic. 


data-page pointer: A seven-bit register used as the seven MSBs in ad- 
dresses generated using direct addressing. 


decode phase: The phase of the pipeline in which the instruction is de- 
coded. 


DIE: See DMA interrupt enable register. 
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DMAcoprocessor: Aperipheralthattransfers the contents of memory loca- 
tions independently of the processor (except for initialization). 


DMA controller: See DMA coprocessor. 


DMA interrupt enable register (DIE): A register (in the CPU register file) 
that controls which interrupts the DMA coprocessor responds to. 


DP: See data-page pointer. 


dual-access RAM: Memory that can be accessed twice in a single clock 
cycle. For example, your code can read from and write to a dual-access 
RAM in one clock cycle. 


external interrupt: A hardware interrupt triggered by a pin. 


extended-precision floating-point format: A 40-bit representation of a 
floating-point number with a 32-bit mantissa and an 8-bit exponent. 


extended-precision register: A 40-bit register used primarily for extended- 
precision floating-point calculations. Floating-point operations use bits 
39-0 of an extended-precision register. Integer operations, however, use 
only bits 31-0. 


FIFO buffer: First-in, first-out buffer. A portion of memory in which data is 
stored and then retrieved in the same order in which it was stored. Thus, 
the first word stored in this buffer is retrieved first. The ’C-4x’s communica- 
tion ports each have two FIFOs: one for transmit operations and one for 
receive operations. 


hardware interrupt: An interrupt triggered through physical connections 
with on-chip peripherals or external devices. 


hit: A condition in which, when the processor fetches an instruction, the 
instruction is available in the cache. 
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IACK: = /nterrupt acknowledge signal. An output signal that indicates that an 
interrupt has been received and that the program counter is fetching the 
interrupt vector that will force the processor into an interrupt service rou- 
tine. 


IIE: See internal interrupt enable register. 
IF: See IIOF flag register. 


IIOF flag register (IIF): Controls the function (general-purpose I/O or inter- 
rupt) of the four external pins (IIOFO to IIOF3). It also contains timer/DMA 
interrupt flags. 


index registers: Two registers (IRO and IR1) that are used by the ARAU for 
indexing an address. 


internal interrupt: A hardware interrupt caused by an on-chip peripheral. 


internal interrupt enable register: A register (in the CPU register file) that 
determines whether or not the CPU will respond to interrupts from the 
communication ports, the timers, and the DMA coprocessor. 


interrupt: A signal sent to the CPU that (when not masked) forces the CPU 
into a subroutine called an interrupt service routine. This signal can be 
triggered by an external device, an on-chip peripheral, or an instruction 
(TRAP, for example). 


interrupt acknowledge (IACK): A signal that indicates that an interrupt has 
been received, and that the program counter is fetching the interrupt vec- 
tor location. 


interrupt vector table (IVT): An ordered list of addresses which each corre- 
spond to an interrupt; when an interrupt occurs and is enabled, the pro- 
cessor executes a branch to the address stored in the corresponding 
location in the interrupt vector table. 


interrupt vector table pointer (IVTP): A register (in the CPU expansion 
register file) that contains the address of the beginning of the interrupt 
vector table. 


ISR: /nterrupt service routine. A module of code that is executed in 
response to a hardware or software interrupt. 


IVTP: See interrupt vector table pointer. 
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LAO-LA30: External address pins for data/program memory or I/O devices. 
These pins are on the local bus. See also AO-A30. 


LDO-LD31: External data-bus pins that transfer data between the processor 
and external data/program memory or I/O devices. See also DO—D31. 


LSB: Least significant bit. The lowest order bit in a word. 


machine cycle: See CPU cycle. 


mantissa: A component of a floating-point number consisting of a fraction 
and a sign bit. The mantissa represents a normalized fraction whose 
binary point is shifted by the exponent. 


maskable interrupt: A hardware interrupt that can be enabled or disabled 
through software. 


memory-mapped register: One of the on-chip registers mapped to ad- 
dresses in memory. Some of the memory-mapped registers are mapped 
to data memory, and some are mapped to input/output memory. 


MFLOPS: Millions of floating-point operations per second. A measure of 
floating-point processor speed that counts of the number of floating-point 
operations made per second. 


microcomputer mode: A mode in which the on-chip ROM is enabled. This 
mode is selected via the MP/MC pin. See also MP/MC pin; microproces- 
sor mode. 


microprocessor mode: A mode in which the on-chip ROM is disabled. This 
mode is selected via the MP/MC pin. See also MP/MC pin; microcomput- 
er mode. 


MIPS: Million instructions-per-second. 


miss: A condition in which, when the processor fetches an instruction, it is 
not available in the cache. 


MSB: Most significant bit. The highest order bit in a word. 


multiplier: A device that generates the product of two numbers. 


NMI: See Nonmaskable interrupt. 
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nonmaskable interrupt (NMI): A hardware interrupt that uses the same 
logic as the maskable interrupts, but cannot be masked. It is often used 
as a soft reset. 


overflow flag (OV) bit: Astatus bit that indicates whether or not an arithme- 
tic operation has exceeded the capacity of the corresponding register. 


PC: See program counter. 


peripheral bus: A bus that the CPU uses to communicate the DMA copro- 
cessor, communication ports, and timers. 


pipeline: A method of executing instructions in an assembly-line fashion. 


program counter: A register that contains the address of the next instruc- 
tion to be fetched. 


RC: See repeat counter register. 


read/write (R/W) pin: This memory-control signal indicates the direction of 
transfer when communicating to an external device. 


register file: A bank of registers. 


repeat counter register: A register (in the CPU register file) that specifies 
the number of times minus one that a block of code is to be repeated 
when a block repeat is performed. 


repeat mode: A zero-overhead method for repeating the execution of a 
block of code. 


reset: A means to bring the central processing unit (CPU) to a known state 
by setting the registers and control bits to predetermined values and 
signaling execution to fetch the reset vector. 


reset pin: This pin causes the device to reset. 


ROMEN: ROM enable. An external pin that determines whether or not the 
the on-chip ROM is enabled. 


Glossary 


R/W: See read/write pin. 


short-floating-point format: A 16-bit representation of a floating-point 
number with a 12-bit mantissa and a 4-bit exponent. 


short-integer format: A twos-complement 16-bit format for integer data. 
short-unsigned-integer format: A 16-bit unsigned format for integer data. 
sign extend: Fill the high order bits of a number with the sign bit. 


single-access RAM: SARAM. Memory that canbe read from or written to 
only once in a single CPU cycle. 


single-precision floating-point format: A 32-bit representation of a float- 
ing point number with a 24-bit mantissa and an 8-bit exponent. 


single-precision integer format: A twos-complement 32-bit format for in- 
teger data. 


single-precision unsigned-integer format: A 32-bit unsigned format for 
integer data. 


software interrupt: An interrupt caused by the execution of a TRAP instruc- 
tion. 


splitmode: A mode of operation of the DMA coprocessor. This mode allows 
one DMA channel to service both the receive and transmit portions of a 
communication port. 


ST: See status register. 


stack: Ablockofmemory reserved for storing and retrieving data on afirst-in 
last-out basis. It is usually used for storing return addresses and for pre- 
serving register values. 


status register: A register (in the CPU register file) that contains global in- 
formation related to the CPU. 


Timer: A programmable peripheral that can be used to generate pulses or 
to time events. 


Timer-Period Register: Timer-period register. A 32-bit memory-mapped 
register that specifies the period for the on-chip timer. 
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trap vector table (TVT): An ordered list of addresses which each corre- 
spond to an interrupt; when a trap is executed, the processor executes 
a branch to the address stored in the corresponding location in the trap 
vector table. 


trap vector table pointer(TVTP): Aregister (in the CPU expansion-register 
file) that contains the address of the beginning of the trap vector table. 


TVTP: See trap vector table pointer. 


unified mode: A mode of operation of the DMA coprocessor. The mode is 
used mainly for memory-to-memory transfers. This is the default mode 
of operation for a DMA channel. See also split mode. 


wait state: A period of time that the CPU must wait for external program, 
data, or I/O memory to respond when reading from or writing to that ex- 
ternal memory. The CPU waits one extra cycle for every wait state. 


wait-state generator: A program that can be modified to generate a limited 
number of wait states for a given off-chip memory space (lower program, 
upper program, data, or I/O). 


zero fill: Fill the low or high order bits with zeros when loading a number into 
a larger field. 
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digital filters 
FIR 6-7 
IIR. See IIR filters 
lattice 6-17 
dimensions 
12-pin header 11-18 
14-pin header 11-12 
mechanical, 14-pin header 11-12 
division 
floating point 3-9 
integer 3-9 
DMA | 3-3, 3-6, 7-13, 7-14, 8-2 
autoinitialization 7-6 
C-programming, examples 7-9 
example 7-11, 7-12 
interrupts, example 7-8 
split mode, autoinitialization 7-15 
split-mode 7-6 
unified mode 7-10 
DMA autoinitialization 7-7 
DMA channel, finished transfer 7-3 
DMA controller. See DMA coprocessor 
DMA coprocessor 
array initialization 7-4 
autoinitialization 7-7 
example 7-8 
definition A-3 
interrupts 7-4 
link-pointer register, example 7-7 
operation examples 7-4 
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programming 7-4 
programming hints 7-2 
split mode example 7-5 
transfer description 7-4 
DMA interrupt enable register (DIE), definition A-3 
DMA programming 7-2 
DMA transfer 7-4 
communication port 7-5 
documentation 10-3 
double precision, fixed point 3-17 
DP. See data-page pointer 
dual-access RAM, definition A-3 
DuPont connector 11-3 


EMUO/1 
configuration 11-19, 11-21, 11-22 
emulation pins 11-18 
IN signals 11-18 
rising edge modification 11-21 
EMUO0/1 signals 11-2, 11-5, 11-6, 11-11, 11-16 
emulation 
JTAG cable 11-1 
timing calculations 11-6 to 11-7, 11-16 to 11-24 
emulator 
connection to target system, JTAG mechanical 
dimensions 11-12 to 11-24 
designing the JTAG cable 11-1 
emulation pins 11-18 
signal buffering 11-8 to 11-11 
target cable, header design 11-2 to 11-3 
emulator pod, JTAG timings 11-5 
extended precision registers 3-17 
extended-precision floating-point format, defini- 
tion A-3 
extended-precision register 2-2 
definition A-3 
external flag pins 4-3 
external interfacing 4-3 
example 4-3 
external interrupt, definition A-3 
external logic 4-12 
external ready generation 4-13 
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fast devices, OR 4-12 
fast Fourier transforms 3-6, 6-56 
DIF (decimation in frequency) 6-24 
DIT (decimation in time) 6-24, 6-42 to 6-54 
inverse 6-73 
fast Fourier transforms (FFT) 6-24 
benchmarks 6-87 
complex radix-2 DIF 6-26 
DIF (decimation in frequency) 6-27 to 6-31, 
6-33, 6-34 to 6-41 
DIT (decimation in time) 6-55 
DIT (decimations in time) 6-41 
DMA 6-24 
real radix-2 6-56 
theories, references 6-24 
twiddle factors 6-32 
twiddle table 6-41 
types of 6-24 
FFT. See fast Fourier transforms 
FIFO buffer, definition A-3 
filters 
adaptive 6-7 
digital. See digital filters 
example 6-10 
FIR 6-14 
See also FIR filters 
IIR 6-12 to 6-15 
See also IIR filters 
FIR filter 
adaptive 6-15 
benchmarks 6-8 
FIR filters 6-7, 6-14 
circular addressing 6-7 
example 6-7 
features 6-7 
FlX instruction 3-9 
FLOAT instruction 3-9 
floating point 
conversion (to/from IEEE) 3-19 
formats 3-19 
IEEE 3-20 
pop and push 2-8 
floating-point, reciprocal 3-12 
example 3-16 
floating-point division 3-12 
floating-point number, inverse, example 3-14 
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formats, floating point 3-19 
forward lattice filter, example 6-19 
FRIEEE instruction 3-19 
fully-connected network 8-19 


GIE 2-11, 2-13 
globalbus 4-3 
control signals 4-11 
global memory interface. See memory interface 


half-word manipulation 3-4 
hardware interrupt, definition A-3 
header 

14-pin 11-2 

dimensions, 14-pin 11-2 
hexagonal grid 8-19 
hit, definition A-3 
hotline 10-3 


TACK, definition A-4 
ICFULL interrupt, example 8-2 
ICRDY communication port 7-7 
ICRDY interrupt, example 8-2 
IEEE 1149.1 specification, bus slave device 
rules 11-3 
IEEE Customer Service, address 11-3 
IEEE standard 11-3 
IIE. See internal interrupt enable register 
IIF. See IIOF flag register 
IIOF flag register (IIF) 7-5 
definition A-4 
IIR filters 6-7, 6-9, 6-9 
benchmarks 6-10, 6-12 to 6-15 
index registers, definition A-4 
initialization, boot.asm 1-9 
initialization routine 1-6 
input port 8-16 
integer division 3-9 
example 3-11 
interface, SRAM 4-8 
two strobes 4-10 


interfaces 

external. See external interfacing 

parallel processing 8-18 

shared bus 4-22 
internal interrupt, definition A-4 
internal interrupt enable register, definition A-4 
interrupt, definition A-4 
interrupt acknowledge (IACK), definition A-4 
interrupt flag register 2-11 
interrupt programming, procedure 2-11 
interrupt service routine, INT2 2-13 
interrupt service routine (ISR), definition A-4 
interrupt vector table (IVT), definition A-4 
interrupt vector table pointer (IVTP), definition A-4 
interrupts 

communication port 8-3 

context switching 2-14 

context-switching 2-11 

DMA 7-4 

dual services, example 2-12 

example 3-2 

examples 2-11 

IVTP reset 2-12 

nesting 2-13 

NMI 2-11 

priorities 2-11 

programming 2-11 

service routines 2-11, 2-13 

software polling, example 2-11 

vector table 2-11 
inverse Fourier transform 6-24 
inverse lattice filter, example 6-18 
inverse of floating point 3-12 
ISR. See interrupt service routine (ISR) 
IVTP 2-12 

See also interrupt vector table pointer 
IVTP register 2-11 


JTAG 11-14 
JTAG emulator 
buffered signals 11-9 
connection to target system 11-1 to 11-24 
no signal buffering 11-8 
podinterface 11-4 
jumps 2-4 
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LAO-LA30, definition. See AO-A30 
LAJ instruction 2-4, 5-5 
lattice filter structure 6-17 
lattice filters 6-17, 6-18 
applications 6-17 
benchmarks 6-20 
forward 6-19 
LBb LBUb instructions 3-4 
LDO-LD31, definition. See DO-D31 
LHw, LHUw instructions 3-4 
linker command file 1-6 
example 1-9 
literature 10-3 
LMS algorithm 6-13 
localbus 4-3 
control signals 4-11 
local memory interface. See memory interface 


local memory interface control register (LMICR), 
LSTRB ACTIVE field 4-9 

loop, delayed block repeat, example 2-19 

loop optimization, example 5-3 

loops 2-18 
single repeat 2-20 

LSB, definition A-5 

LWLct, LWRet instructions 3-4 


machine cycle. See CPU cycle 
mantissa, definition A-5 
maskable interrupt, definition A-5 
matrix vector multiplication, data-memory organiza- 

tion 6-21 
MBct, MHct instructions 3-4 
memory, object exchange, example 5-2 
memory device timing 4-6 
memory interface 4-12 

global 4-4 

local 4-4 

ready generation 4-11 

shared global 4-21 

strobes 4-7 

two banks 4-8 
wait states 4-11 
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memory interface (local, global) 
RAM (zero wait states) 4-7 
shared bus 4-22 


memory interface control registers 4-12 
LSTRB ACTIVE field 4-8 
PAGESIZE field 4-8, 4-18 


memory interfacing, introduction 4-1 
memory map 4-4 
memory-mapped register, definition A-5 


message broadcasting 8-20 
communication ports 8-21 


MFLOPS, definition A-5 


microcomputer mode, definition. See microproces- 


sor mode 


microprocessor mode, definition. See microcomput- 


er mode 
MIPS, definition A-5 
miss, definition A-5 
MPYI3 instruction 3-18 
MPYSHI3 instruction 3-18 
MSB, definition A-5 


mu-law 
compression, expansion 6-2 
conversion, linear 6-2 


multiplication, matrix vector 6-21 


multiplier, definition A-5 


networks 
distributed-memory 4-21 
parallel connectivity 8-18 


Newton-Raphson algorithm 3-12, 3-15 


NMI 2-13 
See also nonmaskable interrupt 


nomenclature 10-9 
nonmaskable interrupt (NMI), definition A-6 


normalization 3-15 
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OCEMPTY interrupt, example 8-2 
OCRDY interrupt, example 8-2 


operations 
examples 3-1 
introduction 3-1 
logical instructions 3-2 


output enable (OE) controls 4-5 


output modes 
external count 11-18 
signal event 11-18 


output port 8-15 
overflow flag (OV) bit, definition A-6 


packing data example 3-4 

page, switching 4-18 

page switching, example 4-19 

PAL 11-19, 11-20, 11-22 

parallel instruction set, optimization use 5-5 


parallel processing 
’C4x to’C4x 8-20 
distributed memory 8-19 
shared and distributed memory 8-19 
shared bus 4-22 
shared memory 4-21, 8-19 


part numbers 
device 10-11 
tools 10-12 


part-order information 10-9 

PC. See program counter 

peripheral bus, definition A-6 
phone numbers, manufacturer 10-6 
pipeline, definition A-6 

pipelined linear array 8-18 

PLD equations 4-16 

polling method, communication port 8-4 
POP instruction 2-7, 2-14 

POPF instruction 2-7 

port driver circuit, diagram 8-16 
primary channel 7-14 

processor, delays 4-5 


processor initialization 1-6 
C language 1-9 
example 1-7 
introduction 1-1 

product vector 6-21 


program control 
instructions 2-1 
introduction 2-1 
program counter, definition A-6 
programming tips 7-2 
introduction 5-1 
protocol, bus 11-3 
pulldown resistor 8-5 
pullups 1-5, 8-5 
PUSH instruction 2-7, 2-14 
PUSHF instruction 2-7 


queues (stack) 2-9 


RW. See read/write pin 
RAM, zero wait states 4-7 
RAMS 4-8 
RAMs 4-5 
RC. See repeat counter register 
RCPF instruction 3-9, 3-12 
readsync 7-13 
read/write (R/W) pin, definition A-6 
ready controllogic 4-14 
ready generation 4-11 
ready signals 4-12 
regional technology centers 10-5 
register file, definition A-6 
registers 

optimization use 5-5 

repeat count (RC) 2-20 

stack pointer (SP) 2-7 
regular subroutine call, example 2-3 
repeat count register (RC) 2-20 
repeat counter register, definition A-6 
repeat mode, definition A-6 
repeat modes, block repeat, restrictions 2-19 
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reset 
definition A-6 
multiprocessing 1-5 
rise/fall time 1-4 
signal generation 1-3 
vector locations 1-2 
vector mapping 1-2 
voltage 1-3 
reset circuit, diagram 1-3 
reset pin 
definition A-6 
voltage, diagram 1-4 
RETIcond instruction 2-13 
RETScond instruction 2-2 
ROMEN, definition A-6 
RPTB and RPTBD instructions 6-24 
optimization use 5-5 
RPTB instruction 2-18 
RPTBD instruction 2-18 
RPTS instruction 2-18 
example 2-18, 3-3 
optimization use 5-5 
RSQRF instruction 3-15, 3-16 
RTCs_ 10-5 
run/stop operation 11-8 
RUNB, debugger command 11-18, 11-19, 11-20, 
11-21, 11-22 
RUNB_ENABLE, input 11-20 


scan path linkers 11-14 
secondary JTAG scan chain to an SPL 11-15 
suggested timings 11-21 
usage 11-14 

scan paths, TBC emulation connections for JTAG 
scanpaths 11-23 


seminars 10-5 

serial resistors 8-5 

shared bus interface 4-22 

shared memory 4-21 

short floating point format, definition A-7 
short integer format, definition A-7 

short unsigned integer format, definition A-7 
signal descriptions, 14-pin header 11-2 
signal quality 8-5 
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signals 
buffered 11-9 


buffering for emulator connections 11-8 to 11-11 


description, 14-pin header 11-2 
timing 11-5 
sign-extend, definition A-7 


single-access RAM (SARAM), definition A-7 
single-precision floating-point format, definition A-7 
single-precision integer format, definition A-7 
single-precision unsigned-integer format, defini- 


tion A-7 
slave devices 11-3 
slow devices, OR 4-12 
sockets 10-6 
325-pin ‘C40, 304-pin ’C44 10-6 
software development tools 
assembler/linker 10-2 
C compiler 10-2 
digital filter design package 10-2 
general 10-12 
linker 10-2 
simulator 10-2 


software interrupt, definition A-7 
software polling, interrupts, example 2-11 
software stack 2-2, 2-11 
split mode, definition A-7 
split mode (DMA) 7-5 
split-mode 7-13, 7-14 
square root, calculation 3-15 
ST. See status register 
stack 2-7 
definition A-7 
stack pointer 2-7 
stack pointer (SP), application 2-7 
stacks 
growth 2-8 
high-to-low memory, diagram 2-9 
low-to-high memory, diagram 2-9 
user 2-8 
status register, definition A-7 
straight, unshrouded, 14-pin 11-3 
STRBx SWW_ 4-12 
strobes 4-9 
wait states 4-7 
SUBB instruction 3-17, 3-18 
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SUBC instruction 3-9 
SUBI instruction 3-18 


subroutine 
subroutines 


2-21 
BA Bed 


calls. See calls 


support tools 
development 10-10 


device 


10-10 


support tools nomenclature 10-9 
system configuration 4-2 


possible 


4-2 


system configuration stack, diagram 2-8 
system initialization 1-3 
system stacks 2-7 

stack pointer 2-7 


target cable 


11-12 


target system, connection to emulator 11-1 to 


11-24 


target-system clock 11-10 


TCK signal 
11-16, 11 
TDI signal 


11-2, 11-3, 11-5, 11-6, 11-11, 11-15, 


-23 


11-2, 11-3, 11-4, 11-5, 11-6, 11-7, 11-10, 


14-11, 11-16, 11-17 


TDO output 
TDO signal 


11-3 
11-3, 11-4, 11-6, 11-7, 11-17, 11-23 


technical assistance 10-3 
test bus controller 11-20, 11-23 
test clock 11-10 


diagram 


11-10 


third-party support 10-3 
Timer, definition A-7 
Timer Period Register, definition A-7 


timing 


bank switching 4-20 
page switching 4-20 
timing calculations 11-6 to 11-7, 11-16 to 11-24 


TMS, signal 
TMS signal 


11-3 
11-2, 11-4, 11-5, 11-6, 11-7, 11-10, 


11-11, 11-15, 11-16, 11-17, 11-23 
TMS/TDl inputs 11-3 
TOIEEE instruction 3-19 


token forcer 
token forcer 


8-15 
circuit, diagram 8-15 


tools, partnumbers 10-12 

tools nomenclature 10-9 

transfer function 6-9 

trap vector table (TVT), definition A-8 

trap vector table pointer (TVTP), definition A-8 

tree structures 8-18 

TRST signal 11-2, 11-5, 11-6, 11-11, 11-15, 11-16, 
11-24 

TSTB instruction 3-2 

TVTP. See trap vector table pointer 


twiddle factor 6-32 
fast Fourier transforms (FFT) 6-41 


unified mode, definition. See split mode 
unpacking data example 3-5 


wait state, definition A-8 
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wait states 4-5, 4-11, 4-15 
consecutive reads, then write 4-6 
consecutive writes, then read 4-7 
full-speed 4-5 
logic 4-14 
memory device timing. See memory device tim- 
ing 
wait-state generator, definition A-8 
workshops 10-5 
write cycles, RAM requirements 4-6 


XDS510 emulator, JTAG cable. See emulation 


zero fill, definition A-8 
zero overhead subroutine call, example 2-5 


ZIF PGA socket 
handle-activated, diagram 10-8 
tool-activated, diagram 10-7 
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